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1. The Physics of Information 




Information is carried, stored, retrieved and processed by machines, whether 
they be electronic computers or living organisms. All information, which in an 
abstract sense one may think of as a string of zeros and ones, has to be carried by a 
physical substrate, be it paper, silicon chips or holograms, and the handling of this 
information is physical, so information is ultimately constrained by the fundamental 
laws of physics. It is therefore not surprising that physics and information share a 
rich interface. 

The notion of information as used by Shannon is a generalization of the notion 
of entropy, which first appeared in thermodynamics. In thermodynamics entropy 
is an abstract quantity depending on heat and temperature whose interpretation is 
not obvious. This changed with the theory of statistical mechanics, which explains 
and generalizes thermodynamics. Statistical mechanics exploits a decomposition 
of a system into microscopic units such as atoms to explain macroscopic phenom- 
ena such as temperature and pressure in terms of the statistical properties of the 
microscopic units. Statistical mechanics makes it clear that entropy can be re- 
garded as a measure of microscopic disorder. The entropy S can be written as 
S = — Y^, Pi log Pit where pi is the probability of a particular microscopic state, for 
example the likelihood that a given atom will have its velocity and position within 
a given range. 

Shannon realized that entropy is useful to describe disorder in much more gen- 
eral settings, which might have nothing to do with atoms or physics. The entropy 
of a probability distribution {pi} is well defined as long as pi is well defined. In this 
more general context he argued that measuring order and measuring disorder are 
essentially the same - in a situation that is highly disordered, making a measure- 
ment gives a great deal of information, and conversely, in a situation that is highly 
ordered, making a measurement gives little information. Thus for a system that can 
randomly be in one of several different states the entropy of its distribution is the 
same as the information gained by knowing which state i it is in. It turns out that 
the concept of entropy or equivalcntly information is useful in many applications 
that have nothing to do with physics. 
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It also turns out that thinking in these more general terms is useful for physics. 
For example, Shannon's work makes it clear that entropy is in some sense more 
fundamental than the quantities from which it was originally derived. This led 
Jaynes to formulate all of statistical mechanics as a problem of maximizing entropy. 
In fact, all of science can be viewed as an application of the principle of maximum 
entropy, which provides a means of quantifying the tradeoff between simplicity and 
accuracy of description. If we want to understand how physical systems can be 
used to perform computations, or construct computer memories, it can be useful 
to define entropies that may not correspond to thermodynamic entropy. But if 
we want to understand the limits to computation it is very useful to think in 
thermodynamic or statistical terms. This has become particularly important in 
efforts to understand how to take advantage of quantum mechanics to improve 
computation. These considerations have given rise to a subfield of physics that is 
often called the physics of information. 

In this chapter we attempt to explain to a non-physicist where the idea of infor- 
mation came from. We begin by describing the origin of the concept of entropy in 
thermodynamics, where entropy is just a macroscopic state variable related to heat 
flow and temperature, a rather mathematical device without a concrete physical 
interpretation. We then discuss how the microscopic theory of atoms led to sta- 
tistical mechanics, which makes it possible to derive and extend thermodynamics. 
This led to the definition of entropy in terms of probabilities on the set of accessible 
microscopic states of a system and provided the inspiration for modern information 
theory starting with the seminal work of Shannon [60 . A close examination of the 
foundations of statistical mechanics and the need to reconcile the probabilistic and 
deterministic views of the world leads us to a discussion of chaotic dynamics, where 
information plays a crucial role in quantifying predictability. We then discuss a va- 
riety of fundamental issues that emerge in defining information and how one must 
exercise care in discussing concepts such as order, disorder, and incomplete knowl- 
edge. We also discuss an alternative form of entropy and its possible relevance for 
nonequilibrium thermodynamics. 

Toward the end of the chapter we discuss how quantum mechanics gives rise 
to the concept of quantum information. Entirely new possibilities for information 
storage and computation are possible due to the massive parallel processing inherent 
in quantum mechanics. We also point out how entropy can be extended to apply to 
quantum mechanics to provide a useful measurement for quantum entanglement. 
Finally we make a small excursion to the interface betweeen quantum theory and 
general relativity, where one is confronted with the "ultimate information paradox" 
posed by the physics of Black Holes. In this review we have limited ourselves; not 
all relevant topics that touch on physics and information have been covered. 

In our quest for more and more volume and speed in storing and processing 
information we are naturally led to the smallest scales we can physically manipulate. 
We began the introduction by quoting Feynman's visionary 1959 lecture "Plenty of 
room at the bottom" |25j where he discusses storing and manipulating information 
on the atomic level. Currently commercially available processors work at scales of 
60 nm (1 nm = 1 nanometer = 10~ 9 meter). In 2006, IBM announced circuitry on 
a 30 nm scale, which indeed makes it possible to write the Encyclopedia Britannica 
on the head of a pin, so Feynmann's speculative remark in 1959 is now just a marker 
of the current scale of computation. To make it clear how close this is to the atomic 
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scale, a square with sides of length 30 rim contains about 1000 atoms. Under the 
historical pattern of Moore's law, integrated circuitry halves in size every 2 years. 
If we continue on the same trajectory of improvement, within about 20 years the 
components will be the size of individual atoms, and it is difficult to imagine that 
computers will be able to get any smaller. Once this occurs information at the 
atomic scale will be directly connected to our use of information on a macroscopic 
scale. There is a certain poetry to this: Once a computer has components on a 
quantum scale, the motion of its atoms will no longer be random, and in a certain 
sense will not be described by classical statistical mechanics, at the same time that 
it will be used to process information on a macroscopic scale. 

2. Thermodynamics 

The truth of the second law is, therefore, a statistical and not a mathematical 
truth, for it depends on the fact that the bodies we deal with consist of millions 
of molecules and that we never can get a hold of single molecules 

J.C. Maxwell 

Thermodynamics is the study of macroscopic physical system^] These systems 
contain a large number of degrees of freedom, typically of the order of Avogadro's 
number, i.e. Na ~ 10 23 . The three laws of thermodynamics describe processes 
in which systems exchange energy with each other or with their environment. For 
example, the system may do work, or exchange heat or mass through a diffusive 
process. A key idea is that of equilibrium, which in thermodynamics is the as- 
sumption that the exchange of energy or mass between two systems is the same in 
both directions; this is typically only achieved when two systems are left alone for 
a long period of time. A process is quasistatic if it always remains close to equi- 
librium, which also implies that it is reversible, i.e that the process can be undone 
and the system can return to its original state without any external energy inputs. 
We distinguish various types of processes, for example an isothermal process in 
which the system is in thermal contact with a reservoir that keeps it at a fixed 
temperature. Another example is an adiabatic process in which the system is kept 
thermally isolated and the temperature is allowed to change. A system may also 
go from one equilibrium state to another via a nonequilibrium process, such as the 
free expansion of a gas or the mixing of two fluids, in which case it is not reversible. 
No real system is fully reversible, but it is nonetheless a very useful concept. 

The remarkable property of systems in equilibrium is that the macro states 
can be characterized by only very few variables, such as the volume V , pressure P, 
temperature T, entropy S, chemical potential /i and particle number N. These state 
variables are in general not independent, but rather are linked by an equation of 
state, which describes the constraints imposed by physics. A familiar example is the 
ideal gas law PV = TV^fcT, where k is the Boltzmann constant relating temperature 
to energy (A; = 1.4 x 10~ 23 joule/ Kelvin). In general the state variables come in 
pairs, one of which is intensive while the other conjugate variable is extensive. 
Intensive variables like pressure or temperature are independent of system size, 
while cxtenstive variables like volume and entropy are proportional to system size. 

In this lightning review we will only highlight the essential features of thermo- 
dynamics that are most relevant in connection with information theory. 

^Many details of this brief expose of selected items from thermodynamics and statistical me- 
chanics can be found in standard textbooks on these subjects |57U43ll36ll47| . 
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2.1. The laws. The first law of thermodynamics reada^ 

dU = dQ~dW (2.1) 

and amounts to the statement that heat is a form of energy and that energy is 
conserved. More precisely, the change in internal energy dU equals the amount of 
heat dQ absorbed by the system minus the work done by the system, dW. 

The second law introduces the concept of entropy S, which is defined as the 
ratio of heat flow to temperature. The law states that the entropy for a closed 
system (with constant energy, volume and number of particles) can never decrease. 
In mathematical terms 

dS „ . 
dS = — , -r>0. (2.2) 

T dt ~ y ' 

By using a gas as the canonical example, we can rewrite the first law in proper 
differentials as 

dU = TdS - PdV, (2.3) 
where PdV is the work done by changing the volume of the container, for example 
by compressing the gas with a piston. It follows from the relation between entropy, 
heat and temperature that entropy differences can be measured by measuring the 
temperature with a thermometer and the change in heat with a calorimeter. This 
illustrates that from the point of view of thermodynamics entropy is a purely macro- 
scopic quantity. 
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Figure 1. The relation between heat and work illustrating the 
two formulations of the second law of thermodynamics. On the 
left we have the Kelvin formulation. The ideal engine corresponds 
to the diagram with the black arrows only. The second law tells us 
that the third, grey arrow is necessarily there. The right picture 
with only the black arrows corresponds to the ideal refrigerator, 
and the third, grey arrow is again required by the second law. 

There are two different formulations of the second law. The Kelvin formulation 
states that it is impossible to have a machine whose sole effect is to convert heat 



The bars through the differentials indicate that the quantities following them are not state 
variables: the d-bars therefore refer to small quantities rather then proper differentials. 
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into work. We can use heat to do work, but to do so we must inevitably make other 
alterations, e.g. letting heat flow from hot to cold and thereby bringing the system 
closer to equilibrium. Clausius' formulation says that it is impossible to have a 
machine that only extracts heat from a reservoir at low temperature and delivers 
that same amount of heat to a reservoir at higher temperature. Rephrasing these 
formulations, Kelvin says that ideal engines cannot exist and Clausius says that 
ideal refrigerators can't exist. See figure [T] 

The action of a heat engine or refrigerator machines can be pictured in a diagram 
in which the reversible sequence of states the system goes through are a closed curve, 
called a Carnot cycle. We give an example for the Kelvin formulation in figure [2] 
Imagine a piston in a chamber; out goal is to use the temperature differential 
between two reservoirs to do work. The cycle consists of four steps: In step a — > b, 
isothermal expansion, the system absorbs an amount Qi of heat from the reservoir 
at high temperature Ti, which causes the gas to expand and push on the piston, 
doing work; In step b — > c, adiabatic expansion, the gas continues to expand and 
do work, but the chamber is detached from the reservoir, so that it no longer 
absorbs any heat. Now as the gas expands it cools until it reaches temperature T 2 . 
In step c — ► d, isothermal compression, the surroundings do work on the gas, as 
heat flows into the cooler reservoir, giving off an amount Q2 of heat; and in step 
d — ► a, adiabatic compression, the surroundings continue to do work, as the gas is 
further compressed (without any heat transfer) and brought back up to the original 
temperature. The net work done by the machine is given by the line integral: 

W = f PdV = enclosed area (2.4) 

J cycle 

which by the first law should also be equal to W = Q-y — Q2 because the internal 
energy is the same at the beginning and end of the cycle. We also can calculate the 
total net change in entropy of the two reservoirs as 

A5=^ + ^>0, (2.5) 

where the last inequality has to hold because of the second law. Note that the two 
latter equations can have solutions with positive W . The efficiency of the engine rj 
is by definition the ratio of the work done to the heat entering the system, or 

W 1 ^ <r 1 Tl fort 

This equals one for an ideal heat engine, but is less then one for a real engine. 

A modern formulation of the second law, which in the setting of statistical me- 
chanics is equivalent to the statements of Kelvin and Clausius, is the Landauer 
principle, which says that there is no machine whose sole effect is the erasure of 
information. There is a price to forgetting: The principle states that the erasure 
of information (which is irreversible) is inevitably accompanied by the generation 
of heat. In other words, logical irreversibility necessarily involves thermodynamical 
irreversibility. One has to generate at least kT In 2 to get rid of one bit of infor- 
mation [44, 45]. We return to the Landauer principle in the section on Statistical 
mechanics. 

We just showed that the second law sets fundamental limits on the possible effi- 
ciency of real machines like steam engines, refrigerators and information processing 
devices. As everybody knows, real engines give off heat and real refrigerators and 
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Figure 2. The Carnot cycle corresponding to the Kelvin formu- 
lation of the second law. The work done by the engine equals the 
line integral along the closed contour and is therefore equal to the 
enclosed area. 

real computers need power to do their job. The second law tells us to what ex- 
tent heat can be used to perform work. The increase of entropy as we go from 
one equilibrium situation to another is related to dissipation and the production of 
heat, which is intimately linked to the important notion of irreversibility. A given 
action in a closed system is irreversible if it makes it impossible for the system to 
return to the state it was in before the action took place without external inputs. 
Irreversibility is always associated with production of heat, because heat cannot be 
freely converted to other forms of energy (whereas any other form of energy can 
always be converted to heat). One can decrease the entropy of a system by doing 
work on it, but in doing the work one has to increase the entropy of another system 
(or of the system's environment) by an equal or greater amount. 

The theory of thermodynamics taken by itself does not connect entropy with 
information. This only comes about when the results are interpreted in terms of a 
microscopic theory, in which case temperature can be interpreted as being related to 
uncertainty and incoherence in the position of particles. This requires a discussion 
of statistical mechanics, as done in the next section. 

There is another fundamental aspect to the second law which is important from 
an operational as well as philosophical point of view. A profound implication of the 
second law is that it defines an "arrow of time" , i.e., it allows us to distinguish the 
past from the future. This is in contrast to the fundamental microscopic laws of 
physics which are time reversal invariant (except for a few exotic interactions, that 
are only very rarely seen under normal conditions as we find them on earth). If one 
watches a movie of fundamental processes on the microscopic level it is impossible 
to tell whether it is running forwards or backwards. In contrast, if we watch a 
movie of macroscopic events, it is not hard to identify irreversible actions such as 
the curling of smoke, the spilling of a glass of water, or the mixing of bread dough, 
which easily allow us to determine whether we are running in forward or reverse. 
More formally, even if we didn't know which way time were running, we could 
pick out some systems at random and measure their entropy at times t\, t%, . . . The 
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direction in which entropy increases is the one that is going forward in time. Note 
that we didn't define an a priori direction of time in formulating the second law - 
it establishes a time direction on its own, without any reference to atomic theory 
or any other laws of physics. 

The second law of thermodynamics talks only about the difference between the 
entropy of different macrostates. The absolute scale for entropy is provided by the 
third law of thermodynamics. This law states that when a system approaches the 
absolute zero of temperature the entropy will go to zero, i.e. 

T^O S^O. (2.7) 

When T = the heat is zero, corresponding classically to no atomic motion, and 
the energy takes on its lowest possible value. In quantum theory we know that such 
a lowest energy "ground" state also exists, though, if the ground state of the system 
turns out to be degenerate the entropy will approach a nonzero constant at zero 
temperature. We conclude by emphasizing that the laws of thermodynamics have 
a wide applicability and a rich phenomenology that supports them unequivocally. 

2.2. Free energy. Physicists are particularly concerned with what is called the 
(Hclmholtz) free energy, denoted F. It is a very important quantity because it de- 
fines the amount of energy available to do work. As we discuss in the next section, 
the free energy plays a central role in establishing the relation between thermo- 
dynamics and statistical mechanics, and in particular for deriving the microscopic 
definition of entropy in terms of probabilities. 
The free energy is defined as 

F=U-TS. (2.8) 
This implies that in differential form we have 

dF = dU - TdS - SdT, (2.9) 



which using (2.3 1 can be written as 

dF = -PdV - SdT. (2.10) 

The natural independent variables to describe the free energy of a gas are volume 
and temperature. 

Let us briefly reflect on the meaning of the free energy. Consider a system A in 
thermal contact with a heat bath A! kept at a constant temperature To. Suppose 
the system A absorbs heat d Q from the reservoir. We may think of the total 
system consisting of system plus bath as a closed system: A° = A + A 1 . For A 
the second law implies that its entropy can only increase: dS° = dS + dS' > 0. As 
the temperature of the heat bath A' is constant and its absorbed heat is —dQ, we 
may write TgdS 1 = —dQ. From the first law applied to system A we obtain that 
— dQ = —dU — dW, so that we can substitute the expression T^dS' = —dU — dW 
in T dS + T Q dS' > to get -dU + T dS > dW. As the system A is kept at a 
constant temperature the left hand side is just equal to —dF, demonstrating that 

-dF>dW. (2.11) 

The maximum work that can be done by the system in contact with a heat reservoir 
is (—dF). If we keep the system parameters fixed, i.e. dW = 0, we obtain that 
dF < 0, showing that for a system coupled to a heat bath the free energy can 
only decrease, and consequently in a thermal equilibrium situation the free energy 
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reaches a minimum. This should be compared with the entropy, which reaches a 
maximum at equilibrium. 

We can think of the second law as telling us how different kinds of energy are 
converted into one another: In an isolated system, work can be converted into heat, 
but heat cannot be converted into work. From a microscopic point of view forms of 
energy that are "more organized", such as light, can be converted into those that 
are "less organized" , such as the random motion of particles, but the opposite is 
not possible. 



From Equation (2.10 1 the pressure and entropy of a gas can be written as partial 



derivatives of the free energy 

*-(£),■ , ™ 

So we see that for a system in thermal equilibrium the entropy is a state variable, 
meaning that if we reversibly traverse a closed path we will return to the same value 
(in contrast to other quantities, such as heat, which do not satisfy this property). 
The variables P and S are dependent variables. This is evident from the Maxwell 
relation, obtained by equating the two second derivatives 



d 2 F d 2 F 



yielding the relation 



(2.13) 
(2.14) 



dTdV dVdT 1 

dp\ _ /as 

df) v ~ \W 
3. Statistical mechanics 

In dealing with masses of matter, while we do not perceive the individual 
molecules, we are compelled to adopt what I have described as the statistical 
method of calculation, and to abandon the strict dynamical method, in which 
we follow every motion by the calculus. 

J.C. Maxwell 

We are forced to be contented with the more modest aim of deducing some of 
the more obvious propositions relating to the statistical branch of mechanics. 
Here there can be no mistake in regard to the agreement with the facts of 
nature. 

J.W. Gibbs 

Statistical mechanics is the explanation of the macroscopic behavior of physical 
systems using the underlying microscopic laws of physics, even though the micro- 
scopic states, such as the position and velocity of individual particles, are unknown. 
The key figures in the late 19th century development of statistical mechanics were 
Maxwell, Boltzmann and Gibbs [5lJ [TTJ [27] . One of the outstanding questions was 
to derive the laws of thermodynamics, in particular to give a microscopic definition 
of the notion of entropy. Another objective was the understanding of phenomena 
that cannot be computed from thermodynamics alone, such as transport phenom- 
ena. For our purpose of highlighting the links with information theory we will give 
a brief and somewhat lopsided introduction. Our main goal is to show the origin of 
the famous expression due to Gibbs for the entropy, S = — ^.p^lnpj, which was 
later used by Shannon to define information. 
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3.1. Definitions and postulates. 

Considerable semantic confusion has resulted from failure to distinguish be- 
tween prediction and interpretation problems, and attempting a single formal- 
ism to do both. 

T.S. Jaynes 

Statistical mechanics considers systems with many degrees of freedom, such as 
atoms in a gas or spins on a lattice. We can think in terms of the microstates of the 
system which are, for example, the positions and velocities of all the particles in a 
vessel with gas. The space of possible microstates is called the phase space. For a 
monatomic gas with N particles, the phase space is 6./V-dimensional, corresponding 
to the fact that under Newtonian mechanics there are three positions and three 
velocities that must be measured for each particle in order to determine its future 
evolution. A microstate of the whole system thus corresponds to a single point in 
phase space. 

Statistical mechanics involves the assumption that, even though we know that 
the microstates exist, we are largely ignorant of their actual values. The only 
information we have about them comes from macroscopic quantities, which are bulk 
properties such as the total energy, the temperature, the volume, the pressure, or 
the magnetization. Because of our ignorance we have to treat the microstates in 
statistical terms. But the knowledge of the macroscopic quantities, along with the 
laws of physics that the microstates follow, constrain the microstates and allow us 
to compute relations between macroscopic variables that might otherwise not be 
obvious. Once the values of the macroscopic variables are fixed there is typically 
only a subset of microscopic states that are compatible with them, which arc called 
the accessible states. The number of accessible states is usually huge, but differences 
in this number can be very important. In this chapter we will for simplicity assume 
a discrete set of microstates, but the formalism can be straightforwardly generalized 
to the continuous case. 

The first fundamental assumption of statistical mechanics is that in equilibrium 
a closed system has an equal a priori probability to be in any of its accessible 
states. For systems that are not closed, for example because they are in thermal 
contact or their particle number is not constant, the set of accessible states will be 
different and their probabilities have to be calculated. In either case we associate an 
ensemble of systems with a characteristic probability distribution over the allowed 
microscopic states. Tolman 68J clearly describes the notion of an ensemble: 

In using ensembles for statistical purposes, however, it is to be noted that 
there is no need to maintain distinctions between individual systems since 
we shall be interested merely in the number of systems at any time which 
would be found in the different states that correspond to different regions 
of phase space. Moreover, it is also to be noted for statistical purposes 
that we shall wish to use ensembles containing a large enough population 
of separate members so that the number of systems in such different 
states can be regarded as changing continuously as we pass from the 
states lying in one region of the phase space to those in another. Hence, 
for the purpose in view, it is evident that the condition of an ensemble 
at any time can be regarded as appropriately specified by the density r 
with which representative points are distributed over phase space. 
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The second postulate of statistical mechanics, called ergodicity, says that time av- 
erages correspond to ensemble averages. That is, on one hand we can take the time 
average by following the deterministic motion of the all the microscopic variables of 
all the particles making up a system. On the other hand, at a given instant in time 
we can take an average over all possible accessible states, weighting them by their 
probability of occurrence. The ergodic hypothesis says that these two averages are 
the same. We return to the restricted validity of this hypothesis in the section on 
nonlinear dynamics. 

3.2. Counting microstates for a system of magnetic spins. In the following 
example we show how it is possible to derive the distribution of microscopic states 
through the assumption of cquipartition and simple counting arguments. This also 
illustrates that the distribution over microstates becomes extremely narrow in the 
thermodynamic (i.e. N — > oo limit). Consider a system of N magnetic spins that 
can only take two values sj = ±1, corresponding to whether the spin is pointing 
up or down (often called Ising spins) . The total number of possible configurations 
equals 2^. For convenience assume N is even, and that the spins do not interact. 
Now put these spins in an upward pointing magnetic field H and ask how many 
configurations of spins are consistent with each possible value of the energy. The 
energy of each spin is ej = TfJ-H, and because they do not interact, the total energy 
of the system is just the sum of the energies of each spin. For a configuration with 
k spins pointing up and N — k spins pointing down the total energy can be written 
as e m = 2mnH with m = (N — 2fc)/2 and —N/2 < m < N/2. The value of e m is 
bounded : —NfiH < e m < NfiH and the difference between two adjacent energy 
levels, corresponding to the flipping of one spin, is Ae = 2[iH. The number of 
microscopic configurations with energy e m equals 

^rn) = m ,- m )= {lN + j; aN _ m)] . (3.1) 

The total number of states is J2 m g(N, m) — 2 N . For a thermodynamic system N 
is really large, so we can approximate the factorials by the Stirling formula 



N\ = V2^NN N e- N+ - (3.2) 

Some elementary math gives the Gaussian approximation for the binomial distri- 
bution for large N, 

i 

g(N, m )^2 N [^y e-^' N . (3.3) 

We will return to this system later on, but at this point we merely want to show that 
for large N the distribution is sharply peaked. Roughly speaking the width of the 
distribution grows with \^N while the peak height grows as 2 N , so the degeneracy of 
the states around m — increases very rapidly. For example g(50, 0) = 1.264x 10 14 , 
but for N w Na one has g(NA,0) = 10 10 . We will return to this example in 
the following section to calculate the magnetization of a spin system in thermal 
equilibrium. 

3.3. The Maxwell-Boltzmann-Gibbs distribution. Maxwell was the first to 
derive an expression for the probability distribution pi for a system in thermal 
equilibrium, i.e. in thermal contact with a heat reservoir kept at a fixed temperature 
T. This result was later generalized by Boltzmann and Gibbs. An equilibrium 
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distribution function of an ideal gas without external force applied to it should not 
depend on either position or time, and thus can only depend on the velocities of 
the individual particles. In general there are interactions between the particles that 
need to be taken into account. A simplifying assumption that is well justified by 
probabilistic calculations is that processes in which two particles interact at once 
are much more common than those in which three or more particles interact. If we 
assume that the velocities of two particles are independent before they interact we 
can write their joint probability to have velocities v\ and v 2 as a product of the 
probability for each particle alone. This implies p(v\,V2) — p{v\)p(v2). The same 
holds after they interact: piv^v^) = pW\)p{ v 2)- I n equilibrium, where nothing 
can depend on time, the probability has to be the same afterward, i.e. p{v\,V2) — 
P{ v 'ii v 2)- How do we connect these conditions before and after the interaction? A 
crucial observation is that there are conserved quantities that are preserved during 
the interaction and the equilibrium distribution function can therefore only depend 
on those. Homogeneity and isotropy of the distribution function selects the total 
energy of the particles as the only function on which the distribution depends. The 
conservation of energy in this situation boils down to the simple statement that 
\mv\ + \mv\ = \mv' 2 + \mv' 2 2 . From these relations Maxwell derived the well 
known thermal equilibrium velocity distribution, 

Mv) = n (£fi' 2 c- mv2/2kT - (3-4) 

The distribution is Gaussian. As we saw, to derive it Maxwell had to make a 
number of assumptions which were plausible even though they couldn't be derived 
from the fundamental laws of physics. Boltzmann generalized the result to include 
the effect of an external conservative force, leading to the replacement of the kinetic 



energy in (3.4 1 by the total conserved energy, which includes potential as well as 
kinetic energy. 

Boltzmann's generalization of Maxwell's result makes it clear that the probability 
distribution pi for a general system in thermal equilibrium is given by 

e -6i/T 

ft = • (3-5) 

Z is a normalization factor that ensures the conservation of probability, i.e. J^i pi — 
1. This implies that 

Z^J2^' /T - (3-6) 

i 

Z is called the partition function. The Boltzmann distribution describes the canon- 
ical ensemble, that is it applies to any situation where a system is in thermal 
equilibrium and exchanging energy with its environment. This is in contrast to 
the microcanonical ensemble, which applies to isolated systems where the energy 
is constant, or the grand canonical ensemble, which applies to systems that are 
exchanging both energy and particles with their environment^} To illustrate the 
power of the Boltzmann distribution let us briefly return to the example of the 
thermal distribution of Ising spins on a lattice in an external magnetic field. As we 



^Gibbs extended the Boltzmann result to situations where the number of particles is not 
fixed, leading to the introduction of the chemical potential. Because of its complicated history, 
the exponential distribution is referred to by a variety of names, including Gibbs, Boltzmann, 
Boltzmann-Maxwell, and Boltzmann-Gibbs. 
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pointed out in section (3.2), the energy of a single spin is ±/j,H. According to the 



Boltzmann distribution, the probabilities of spin up or spin down are 

e T^H/T 

P± = —z—. (3.7) 

The spin antiparallel to the field has lowest energy and therefore is favored. This 
leads to an average field dependent magnetization mjy (per spin) 

m H = {,) = ^ + [-^ P - =^ U ^. (3.8) 
P++P- T 

This example shows how statistical mechanics can be used to establish relations be- 
tween macroscopic variables that cannot be obtained using thermodynamics alone. 

3.4. Free energy revisited. In our discussion of thermodynamics in section 2.2 we 
introduced the concept of the free energy F defined by equation |2.8[ and argued that 
it plays a central role for systems in thermal contact with a heat bath, i.e. systems 
kept at a fixed temperature T. In the previous section we introduced the concept 
of the partition function Z defined by equation |3.6| Because all thermodynamic 
quantities can be calculated from it, the importance of the partition function Z 
goes well beyond its role as a normalization factor. The free energy is of particular 
importance, because its functional form leads directly to the definition of entropy in 
terms of probabilities. We can now directly link the thermodynamical quantities to 
the ones defined in statistical mechanics. This is done by postulating the relation 
between the free energy and the partition function a^] 

F = -ThxZ, (3.9) 

or alternatively Z — e~ F l T . From this definition it is possible to calculate all 



thermodynamical quantities, for example using equations (2.121. We will now derive 



the expression for the entropy in statistical mechanics in terms of probabilities. 



3.5. Gibbs entropy. The definition of the free energy in equation (2.8) implies 
that 

U-F 

S=^~- (3.10) 



From (3.9) and (3.51 it follows that 

F = e t +Tln Pl . (3.11) 

Note that even though both the terms on the right depend on i the free energy F 
is independent of i. The equilibrium value for the internal energy is by definition 

U = (e)=^2e iPi . (3.12) 



Once we have identified a certain macroscopic quantity like the free energy with a microscopic 
expression, then of course the rest follows. Which expression is taken as the starting point for the 
identification is quite arbitrary. The justification is a posteriori in the sense that the well known 
thermodynamical relations should be recovered. 

^Boltzmann's constant k relates energy to temperature. Its value in conventional units is 
1.4x 10 — 23 joule / kelvin, but we have set it equal to unity, which amounts to choosing a convenient 
unit for energy or temperature. 
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With these expressions for S, F and U, and making use of the fact that F is inde- 
pendent of i and J^. pi = 1, we can rewrite the entropy in terms of the probabilities 
Pi and arrive at the famous expression for the entropy: 

S = Pilnp t . (3.13) 

i 

This expression is usually called the Gibbs entropjj^] 

In the special case where the total energy is fixed, the w different (accessible) 
states all have equal a priori probability pi — p = l/w. Substitution in the Gibbs 
formula yields the expression in terms of the number of accessible states, originally 
due to Boltzmann (and engraved on his tombstone): 

S = lnw. (3.14) 

We emphasize that the entropy grows logarithmically with the number of accessible 
state^J Consider a system consisting of a single particle that can be in one of 
two states. Assuming equipartition the entropy is Si = In 2. For a system with 
Avogadro's number of particles N ~ 10 23 , so there are 2 N states and if we assume 
independence the entropy is Sn = hi 2 W — NSi, a very large number. The tendency 
of a system to maximize its entropy is a probabilistic statement: The number of 
states with half of the particles in one state and half in the other is enormously 
larger than the number in which all the particles are in the same state, and when 
the system is left free it will relax to the most probable accessible state. The 
state of a gas particle depends not only on its allowed position (i.e. the volume 
of the vessel), but also on its allowed range of velocities: If the vessel is hot that 
range is larger then when the vessel is cold. So for an ideal gas one finds that the 
entropy increases with the logarithm of the temperature. The fact that the law is a 
probabilistic implies that it is not completely impossible that the system will return 
to a highly improbable initial state. Poincare showed that it is bound to happen 
and gave an estimate of the recurrence time (which for a macroscopic system is 
much larger than the lifetime of the universe). 

The Gibbs entropy transcends its origins in statistical mechanics. It can be used 
to describe any system with states {V>i} and a given probability distribution {pi}- 
Credit for realizing this is usually given to Shannon [60 , although antecedents 
include Szilard, Nyquist and Hartley. Shannon proposed that by analogy to the 
entropy S, information can be defined as 

H=-^2,pi log 2 Pi. (3.15) 

i 

In information theory it is common to take logarithms in base two and drop the 
Boltzmann constant^) Base two is a natural choice of units when dealing with 
binary numbers and the units of entropy in this case are called bits; in contrast, 
when using the natural logarithm the units are called nats, with the conversion 
that 1 nat =1.443 bits . For example a memory consisting of 5 bits (which is the 

^In quantum theory this expression is replaced by S = —Tr plnp where p is the density matrix 
of the system. 

^These numbers can be overwhelmingly large. Imagine two macrostates of a system which differ 
by 1 millicalorie at room temperature. The difference in entropy is AS = —AQ/T = 10~ 3 /293 ~ 
10 — 5 . Thus the ratio of the number of accessible states is W2/v>\ = exp(AS/k) S3 exp(10 18 ), a 
big number! 

"in our convention k=l, so H = S/ In 2. 
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same as a system of 5 Ising spins), has N = 2 5 states. Without further restrictions 
all of these states (messages) have equal probability i.e. Pi — l/N so that the 
information content is H — ~Nj^ log 2 -h = log 2 2 5 = 5 bits. Similarly consider 
a DNA-molecule with 10 billion base pairs, each of which can be in one of four 
combinations (A-T,C-G,T-A,G-C). The molecule can a priori be in any of 4 10 
configurations so the naive information content (assuming independence) is H = 
2 x 10 10 bits. The logarithmic nature of the definition is unavoidable if one wants 
the additive property of information under the addition of bits. If in the previous 
spin example we add another string of 3 bits then the total number of states is 
N = NiN 2 = 2 5 x 2 3 = 2 8 from which it also follows that H = H 1 + H 2 = 8. If 
we add extra ab initio correlations or extra constraints we reduce the number of 
independent configurations and consequently H will be smaller. 

As we will discuss in Section [5] this quantitative definition of information and its 
applications transcend the limited origin and scope of conventional thermodynam- 
ics and statistical mechanics, as well as Shannon's original purpose of describing 
properties of communication channels. See also [T3] . 

4. Nonlinear dynamics 

The present state of the system of nature is evidently a consequence of 
what it was in the preceding moment, and if we conceive of an intelligence 
which at a given instant comprehends all the relations of the entities of 
this universe, it could state the respective position, motions, and general 
effects of all these entities at any time in the past or future. 

Pierre Simon de Laplace (1776) 

A very small cause which escapes our notice determines a considerable 
effect that we cannot fail to see, and then we say that the effect is due 
to chance. 

Henri Poincare (1903). 

From a naive point of view statistical mechanics seems to contradict the deter- 
minism of Newtonian mechanics. For any initial state x(0) (a vector of positions 
and velocities) Newton's laws define a dynamical system (a set of differential 
equations) that maps x(0) into its future states x(t) = <^*(a;(0)). This is completely 
deterministic. As Laplace so famously asserted, if mechanical objects obey New- 
ton's laws, why do we need to discuss perfect certainties in statistical terms? 
Laplace partially answered his own question: 

. . . But ignorance of the different causes involved in the production of 
events, as well as their complexity, taken together with the imperfection 
of analysis, prevent our reaching the same certainty [as in astronomy] 
about the vast majority of phenomena. Thus there are things that are 
uncertain for us, things more or less probable, and we seek to compen- 
sate for the impossibility of knowing them by determining their different 
degrees of likelihood. So it is that we owe to the weakness of the human 
mind one of the most delicate and ingenious of mathematical theories, 
the science of chance or probability. 

Laplace clearly understood the need for statistical descriptions, but at that point 
in time was not fully aware of the consequences of nonlinear dynamics. As Poincare 
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later showed, even without human uncertainty (or quantum mechanics), when New- 
ton's laws give rise to differential equations with chaotic dynamics, we inevitably 
arrive at a probabilistic description of nature. Although Poincare discovered this 
in the course of studying the three body problem in celestial mechanics, the an- 
swer he found turns out to have relevance for the reconciliation of the deterministic 
Laplacian universe with statistical mechanics. 

4.1. The ergodic hypothesis. As we mentioned in the previous section, one of the 
key foundations in Boltzmann's formulation of statistical mechanics is the ergodic 
hypothesis. Roughly speaking, it is the hypothesis that a given trajectory will 
eventually find its way through all the accessible microstates of the system, e.g. all 
those that are compatible with conservation of energy. At equilibrium the average 
length of time that a trajectory spends in a given region of the state space is 
proportional to the number of accessible states the region contains. If the ergodic 
hypothesis is true, then time averages equal ensemble averages, and equipartition 
is a valid assumption. 

The ergodic hypothesis proved to be highly controversial for good reason: It is 
generally not true. The first numerical experiment ever performed on a computer 
took place in 1947 at Los Alamos when Fermi, Pasta, and Ulam set out to test 
the ergodic hypothesis. They simulated a system of masses connected by nonlinear 
springs. They perturbed one of the masses, expecting that the disturbance would 
rapidly spread to all the other masses and equilibrate, so that after a long time 
they would find all the masses shaking more or less randomly. Instead they were 
quite surprised to discover that the disturbance remained well defined - although 
it propagated through the system, it kept its identity, and after a relatively short 
period of time the system returned very close to its initial state. They had in fact 
rediscovered a phenomenon that has come to be called a soliton, a localized but 
very stable travelling disturbance. There are many examples of nonlinear systems 
that support solitons. Such systems do not have equal probability to be in all their 
accessible states, and so are not ergodic. 

Despite these problems, there are many examples where we know that statistical 
mechanics works extremely well. There are even a few cases, such as the hard sphere 
gas, where the ergodic hypothesis can actually be proved. But more typically this 
is not the case. The evidence for statistical mechanics is largely empirical: we know 
that it works, at least to a very high degree of approximation. Subsequent work 
has made it clear that the typical situation is much more complicated than was 
originally imagined. While some trajectories may wander in more or less random 
fashion around much of the accessible phase space, they are blocked from entering 
certain regions by what are called KAM (Kolomogorov-Arnold-Moser) tori. Other 
initial conditions yield trajectories that make regular motions and lie on KAM tori 
trajectories. The KAM tori are separated from each other, and have a lower dimen- 
sion than the full accessible phase space. Such KAM tori correspond to situations 
in which there are other conversation laws in addition to the conservation of en- 
ergy, which may depend on initial conditions as well as other parameters^] Solitons 



Dynamical systems that conserve energy and obey Newton's laws have special properties that 
cause the existence of KAM tori. Dissipative systems typically have attractors, subsets of the 
state space that orbits converge onto. Energy conserving systems do not have attractors, and 
often have chaotic orbits tightly interwoven with regular orbits. 
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are examples of this in which the solutions can be interpreted as a geometrically 
isolated pulse. 

There have now been an enormous number of studies of ergodicity in nonlinear 
dynamics. While there are no formal theorems that definitively resolve this, the 
accumulated lore from these studies suggests that for nonlinear systems that do 
not have hidden symmetries, as the number of interacting components increases 
and nonlinearities become stronger, the generic behavior is that chaotic behavior 
becomes more and more likely the KAM tori shrink, fewer and fewer initial 
conditions are trapped on them, and the regions they exclude become smaller. 
The ergodic hypothesis becomes an increasingly better approximation, a typical 
single trajectory can reach almost all accessible states, and equipartition becomes 
a good assumption. The problems occur in understanding when there are hidden 
symmetries that can support phenomena like solitons. The necessary and sufficient 
conditions for ergodicity to be a good assumption remains an active field of research. 

4.2. Chaos and limits to prediction. The discovery of chaos makes it clear 
that Boltzmann's use of probability is even more justified than he realized. When 
motion is chaotic, two infinitesimally nearby trajectories separate at an exponential 
rate [501 EH US EH]- This is a geometric property of the underlying nonlinear 
dynamics. From a linear point of view the dynamics are locally unstable. To make 
this precise, consider two N dimensional initial conditions x(0) and x'(0) that are 
initially separated by an infinitesimal vector Sx(0) — x(0) — x'(0). Providing the 
dynamical system is diffcrentiable, the separation will grow as 

Sx(t) = D0*(x(O))<5x(O), (4.1) 

where D(j> t (x(0)) is the derivative of the dynamical system 0* evaluated at the ini- 
tial condition x(0). For any fixed time t and initial condition x(0), Dip 1 is just 
an N x N matrix, and this is just a linear equation. If the motion is chaotic the 
length of the separation vector 5x will grow exponentially with t in at least one 
direction, as shown in Figure [3] The figure shows how the divergence of nearby 
trajectories is the underlying reason chaos leads to unpredictability. A perfect mea- 
surement would correspond to a point in the state space, but any real measurement 
is inaccurate, generating a cloud of uncertainty. The true state might be anywhere 
inside the cloud. As shown here for the Lorenz equations (a simple system of three 
coupled nonlinear differential equations |50J, the uncertainty of the initial mea- 
surement is represented by 10,000 red dots, initially so close together that they 
are indistinguishable; a single trajectory is shown for reference in light blue. As 
each point moves under the action of the equations, the cloud is stretched into a 
long, thin thread, which then folds over onto itself many times, until the points are 
mixed more or less randomly over the entire attractor. Prediction has now become 
impossible: the final state can be anywhere on the attractor. For a regular motion, 
in contrast, all the final states remain close together. We can think about this in 
information theoretic terms; for a chaotic motion information is initially lost at a 
linear rate which eventually results in all the information being lost - for a regular 
motion the information loss is relatively small. The numbers above the illustration 
are in units of 1/200 of the natural time units of the Lorenz equations. (From [14J. 

Nonetheless, at the same time the motion can be globally stable, meaning that 
it remains contained inside a finite volume in the phase space. This is achieved 
by stretching and folding - the nonlinear dynamics knead the phase space through 
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Figure 3. The divergence of nearby trajectories for the Lorenz 
equations. See the text for an explanation 
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local stretching and global folding, just like a baker making a loaf of bread. Two 
trajectories that are initially nearby may later be quite far apart, and still later, 
may be close together again. This property is called mixing. More formally, the 
dynamics are mixing over a given set E and invariant measurcp^] fi with support £ 
such that for any subsets A and B 

lim (i^B n A) = fi(A)n(B). (4.2) 

t — >oo 

Intuitively, this just means that B is smeared throughout £ by the flow, so that 
the probability of finding a point originating in B inside of A is just the original 
probability of B, weighted by the probability of A. Geometrically, this happens 
if and only if the future trajectory of B is finely "mixed" throughout £ by the 
stretching and folding action of 0*. 

Mixing implies ergodicity, so any dynamical system that is mixing over £ will 
also be ergodic on £. It only satisfies the ergodic hypothesis, however, if £ is the 
set of accessible states. This need not be the case. Thus, the fact that a system has 
orbits with chaotic dynamics doesn't mean that it necessarily satisfies the ergodic 
hypothesis - there may be still be subsets of finite volume in the phase space that 
are stuck making regular motions, for example on KAM tori. 

Nonetheless, chaotic dynamics has strong implications for statistical mechanics. 
If a dynamical system is ergodic but not mixingp*] by measuring the microstates 
it is in principle possible to make detailed long range predictions by measuring the 
position and velocity of all its microstates, as suggested by Laplace. In contrast, if 
it is mixing then even if we know the initial values of the microstates at a high (but 
finite) level of precision, all this information is asymptotically lost, and statistical 
mechanics is unavoidable^ 

4.3. Quantifying predictability. Information theory can be used to quantify pre- 
dictability [61 . To begin the discussion, consider a measuring instrument with a 
uniform scale of resolution e. For a ruler, for example, e is the distance between 
adjacent graduations. If such a measuring instrument is assigned to each of the N 
real variables in a dynamical system, the graduations of these instruments induce 
a partition H of the phase space, which is a set of non-overlapping N dimensional 
cubes, labeled Cj, which we will call the outcomes of the measurement. A mea- 
surement determines that the state of the system is in a given cube Ci. If we let 
transients die out, and restrict our attention to asymptotic motions without exter- 
nal perturbations, let us assume the motion is confined to a set £ (which in general 
depends on the initial condition) . We can then compute the asymptotic probability 
of a given measurement by measuring its frequency of occurrence pi, and if the 
motion is ergodic on £, then we know that there exists an invariant measure \x 
such that pi = [i(Ci). To someone who knows the invariant measure /i but knows 
nothing else about the state of the system, the average information that will be 



A measure is invariant over a set E with respect to the dynamics 0' if it satisfies the condition 
n(A) = fj,(rf>~ t (A)) , where A is any subset of E. There can be many invariant measures, but the 
one that we have in mind throughout is the one corresponding to time averages. 

11 A simple example of a system that is ergodic but not mixing is a dynamical system whose 
solution is the sum of two sinusoids with irrationally related frequencies. 

12 An exception is that some systems display phase invariance even while they are chaotic. 
The orbits move around an attractor, being chaotically scrambled transverse to their direction of 
motion but keeping their timing for completing a circuit of the attractor |23l . 
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gained in making a measurement is just the entropy 

1(e) = -J^Pi log Pi . (4.3) 

i 

We are following Shannon in calling this "information" since it represents the el- 
ement of surprise in making the measurement. The information is written /(e) to 
emphasize its dependence on the scale of resolution of the measurements. This can 
be used to define a dimension for \x. This is just the asymptotic rate of increase of 
the information with resolution, i.e. 

£>=lim-^-. (4.4) 
e^o | log e | 

This is called the information dimension |22j . Note that this reduces to what is 
commonly called the fractal dimension when pi is sufficiently smooth, i.e. when 
^iPilogpi w logn, where n is the number of measurement outcomes with nonzero 
values of pi . 

This notion of dimension can be generalized by using the Renyi entropy R a 

R a =-^— log YV ( 4 - 5 ) 
1 — a ' 

i 

where a > and a / 1. The value for a = 1 is defined by taking the limit as 
a — * 1, which reduces to the usual Shannon entropy. By replacing the Shannon 
entropy by the Renyi entropy it is possible to define a generalized dimension d a . 
This contains the information dimension in the special case a = 1 . This has proved 
to be very useful in the study of multifractal phenomena (fractals whose scalings 
are irregular) . We will say more about the use of such alternative entropies in the 
next section. 

The discussion so far has concerned the amount of information gained by an 
observer in making a single, isolated measurement, i.e. the information gained in 
taking a "snapshot" of a dynamical system. We can alternatively ask how much 
new information is obtained per unit time by an observer who is watching a movie 
of a dynamical system. In other words, what is the information acquisition rate of 
an experimenter who makes a series of measurements to monitor the behavior of a 
dynamical system? For a regular dynamical system (to be defined more precisely 
in a moment) new measurements asymptotically provide no further information in 
the limit t — > oo. But if the dynamical system is chaotic, new measurements are 
constantly required to update the knowledge of the observer in order to keep the 
observer's knowledge of the state of the system at the same resolution. 

This can be made more precise as follows. Consider a sequence of m measure- 
ments (xi,X2, ■ ■ ■ ,x m ) — X m , where each measurement corresponds to observing 
the system in a particular N dimensional cube. Letting p(X m ) be the probability 
of observing the sequence X m , the entropy of this sequence of measurements is 

H m = -Y,P(Xm) logp(X rn ) (4.6) 

i 

We can then define the information acquisition rate as 

h= hm (4.7) 

m—>oo m/\t 

At is the sampling rate for making the measurements. Providing At is sufficiently 
small and other conditions are met, h is equal to the metric entropy, also called the 
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Kolmogorov-Sinai (KS) entrop^\ Note that this is not really an entropy, but an 
entropy production rate, which (if logs are taken to base 2) has units of bits/second. 
If h)Q the motion is chaotic, and if h = it is regular. Thus, when the system is 
chaotic, the entropy H m contained in a sequence of measurements continues to 
increase even in the limit as the sequence becomes very long. In contrast, for a 
regular motion this reaches a limiting value. 

Although we have so far couched the discussion in terms of probabilities, the 
metric entropy is determined by geometry. The average rates of expansion and 
contraction in a trajectory of a dynamical system can be characterized by the 
spectrum of Lyapunov exponents. These are defined in terms of the eigenvalues 
of Dft, the derivative of the dynamical system, as defined in equation |4.l| For a 
dynamical system in N dimensions, let the N eigenvalues of the matrix D<f> (x{0)) 
be cxi{t). Because Dft is a positive definite matrix, the oti are all positive. The 
Lyapunov exponents are defined as A, = lim^oo log oti(t)/t. To think about this 
more geometrically, imagine an infinitesimal ball that has radius e(0) at time t = 0. 
As this ball evolves under the action of the dynamical system it will distort. Since 
the ball is infinitesimal, however, it will remain an ellipsoid as it evolves. Let 
the principal axes of this ellipsoid have length The spectrum of Lyapunov 

exponents for a given trajectory passing through the initial ball is 

Aj = lim lim -log (4.8) 
t^oo e(0 )^ot 6 e(0) v ; 

For an N dimensional dynamical system there are N Lyapunov exponents. The 
positive Lyapunov exponents A + measure the rates of exponential divergence, and 
the negative ones A - the rates of convergence. They are related to the metric 
entropy by Pesin's theorem 

h = J2 X t- (4-9) 

i 

In other words, the metric entropy is the sum of the positive Lyapunov exponents, 
and it corresponds to the average exponential rate of expansion in the phase space. 

Taken together the metric entropy and information dimension can be used to 
estimate the length of time that predictions remain valid. The information di- 
mension allows an estimate to be made of the information contained in an initial 
measurement, and the metric entropy estimates the rate at which this information 
decays. 

As we have already seen, for a series of measurements the metric entropy tells 
us the information gained with each measurement. But if each measurement is 
made with the same precision, the information gained must equal the information 
that would have been lost had the measurement not been made. Thus the metric 
entropy also quantifies the initial rate at which knowledge of the state of the system 
is lost after a measurement. 

To make this more precise, let Pijit) be the probability that a measurement 
at time t has outcome j if a measurement at time has outcome i. In other 
words, given the state was measured in partition element d at time 0, what is the 
probability it will be in partition element Cj at time tl . By definition Pij(0) = 1 

In our discussion of metric entropy we are sweeping many important mathematical formalities 
under the rug. For example, to make this definition precise we need to take a suprcmum over all 
partitions and sampling rates. Also, it is not necessary to make the measurements in N dimensions 
- there typically exists a one dimensional projection that is sufficient, under an optimal partition. 
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if i = j and Pij(0) = otherwise. With no initial information, the information 
gained from the measurement is determined solely by the asymptotic measure /z, 
and is — log fi(Cj). In contrast, if Ci is known the information gained on learning 
outcome j is — log pij(t). The extra information using a prediction from the initial 
data is the difference of the two or log(pij(t)/ (i(Cj)). This must be averaged over 
all possible measurements Cj at time t, and all possible initial measurements Cj. 
The measurements Cj are weighted by their probability of occurrence pij (t) , and 
the initial measurements are weighted by fi(Cj). This gives 



It can easily be shown that in the limit where the initial measurements are made 
arbitrarily precise, I(t) will initially decay at a linear rate, whose slope is equal to 
the metric entropy. For measurements with signal to noise ratio s, i.e. with log s ~ 
| log e\, 1(0) « Dj log s. Thus I(t) can be approximated as I(t) as Di log s — ht, and 
the initial data becomes useless after a characteristic time r = (Dj/h) log s. 

To conclude, chaotic dynamics provides the link that connects deterministic dy- 
namics with probability. While we can discuss chaotic systems in completely de- 
terministic terms, as soon as we address problems of measurement and long-term 
predictability we are forced to think in probabilistic terms. The language we have 
developed above, of information dimension, Lyapunov exponents, and metric en- 
tropy, provide the link between the geometric and probabilistic views. Chaotic dy- 
namics can happen even in a few dimensions, but as we move to high dimensional 
systems, e.g. when we discuss the interactions between many particles, probability 
is thrust on us for two reasons: The difficulty of keeping track of all the degrees 
of freedom, and the "increased likelihood" that nonlinear interactions will give rise 
to chaotic dynamics. "Increased likelihood" is in quotations because, despite more 
than a century of effort, understanding the necessary and sufficient conditions for 
the validity of statistical mechanics remains an open problem. 



In this section we will discuss various aspects of entropy, its relation with informa- 
tion theory and the sometimes confusing connotations of order, disorder, ignorance 
and incomplete knowledge. This will be done by treating several well known puz- 
zles and paradoxes related with the concept of entropy. A derivation of the second 
law using the procedure called coarse graining is presented. The extensivity or ad- 
ditivity of entropy is considered in some detail, also when we discuss nonstandard 
extensions of the definition of entropy. 

5.1. Entropy and information. The important innovation Shannon made was 
to show that the relevance of the concept of entropy considered as a measure of 
information was not restricted to thermodynamics, but could be used in any context 
where probabilities can be defined. He applied it to problems in communication 
theory and showed that it can be used to compute a bound on the information 
transmission rate using an optimal code. 

One of the most basic results that Shannon obtained was to show that the choice 
of the Gibbs form of entropy to describe uncertainty is not arbitrary, even when it is 
used in a very general context. Both Shannon and Khinchin |41j proved that if one 
wants certain conditions to be met by the entropy function then the functional form 
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originally proposed by Gibbs is the unique choice. The fundamental conditions as 
specified by Khinchin are: 

(1) For a given n and X)"=i Pi ~ 1) the required function H{p±, ...p n ) is maxi- 
mal for all pi = 1 /n. 

(2) The function should satisfy H(pi, ...p n ,0) — H(pi, ...p n ). The inclusion of 
an impossible event should not change the value of H . 

(3) If A and B are two finite sets of events, not necessarily independent, the 
entropy H(A, B) for the occurrence of joint events A and B is the entropy 
for the set A alone plus the weighted average of the conditional entropy 



H(B\Ai) for B given the occurrence of the i event A4 in A 
H(A, B) = H(A) + J2 Pi H(B\Ai) 



(5.1) 



where event Ai occurs with probability pi . 
The important result is that given these conditions the function H given in equation 



(3.151 is the unique solution. Shannon's key insight was that the results of Boltz- 
mann and Gibbs in explaining entropy in terms of statistical mechanics had unin- 
tended and profound side-effects, with a broader and more fundamental meaning 
that transcended their physical origin of entropy. The importance of the abstract 
conditions formulated by Shannon and Khinchin show the very general context in 
which the Gibbs-Shannon function is the unique answer. Later on we will pose 
the question of whether there are situations where not all three conditions are 
appropriate, leading to alternative expressions for the entropy. 

5.2. The Landauer principle. Talking about the relation between information 
and entropy it may be illuminating to return briefly to the Landauer principle |44l 
I45j . which as we mentioned in the first section, is a particular formulation of the 
second law of thermodynamics well suited for the context of information theory. 
The principle expresses the fact that erasure of data in a system necessarily in- 
volves producing heat, and thereby increasing the entropy. We have illustrated the 
principle in figure |4j Consider a "gas" consisting of a single atom in a symmetric 
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Figure 4. An illustration of the Landauer principle using a very 
simple thermodynamical system. 

container with volume 2V, in contact with a heat bath. We imagine that the posi- 
tion of the particle acts as a memory with one bit of information, corresponding to 
whether the atom is on the left or on the right. Erasing the information amounts 
to resetting the device to the "reference" state 1 independent of the initial state. 
Erasure corresponds therefore to reinitializing the system rather then making a 
measurement. It can be done by first opening a diaphragm in the middle, then 
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reversibly moving the piston from the right in, and finally closing the diaphragm 
and moving the piston back. In the first step the gas expands freely to the double 
volume. The particle doesn't do any work, the energy is conserved, and therefore 
no heat will be absorbed from the reservoir. This is an irreversible adiabatic process 
by which the entropy S of the gas increases by a factor khx2V/V — A: In 2. (The 
number of states the particle can be in is just the volume; the average velocity is 
conserved because of the contact with the thermal bath and will not contribute 
to the change in entropy). In the second part of the erasure procedure we bring 
the system back to a state which has the same entropy as the initial state. We 
do this through a quasistatic (i.e. reversible) isothermal process at temperature T. 
During the compression the entropy decreases by fcln2. This change of entropy is 
nothing but the amount of heat delivered by the gas to the reservoir divided by the 
temperature, i.e. AS — J dS — J dQ/T = AQ/T. The heat produced AQ equals 
the net amount of work W that has been done in the cycle by moving the piston 
during the compression. The conclusion is that during the erasure of one bit of 
information the device had to produce at least AQ = /cTln2 of heat. 

We may look at the same process somewhat more abstractly, purely from the 
point of view of information. We map the erasure of information for the simple 
memory device on the sequence of diagrams depicted in figure [5] We choose this 




ABC 

Figure 5. A phase space picture of Landauer's principle. See text 
for an explanation. 

representation of the accessible (phase) space to clearly mark the differences be- 
tween the situation where the particle is in the left or the right (A) , the left and the 
right (B), and the left compartment only (C). In part A the memory corresponds to 
the particle being cither in the left or in the right compartment. In B the partition 
has been removed and through the free expansion the phase space has doubled and 
consequently the entropy increased by In 2. In C the system is brought back to 
the reference state, i.e. the particle is brought in the left compartment. This is 
done by moving a piston in from the right, inserting the partition, and moving the 
piston out again. It is in the compressing step that the phase space is reduced by 
a factor of two and hence entropy is reduced by In 2. This is possible because we 
did work, producing a corresponding amount of heat (AQ > Tin 2). Note that in 
this representation one can in principle change the sizes of the partitions along the 
horizontal directions and the a priori probabilities along the vertical direction to 
model different types or aspects of memory devices. 

5.3. The entropy as a relative concept. 

Irreversibility is a consequence of the explicit introduction of ignorance into the 
fundamental laws. 

M. Born 
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There is a surprising amount of confusion about the interpretation and meaning 
of the concept of entropy |31l 117) . One may wonder to what extent the "entropic 
principle" just is an "anthropocentric principle"? That is, does entropy depend 
only on our perception, or is it something more fundamental? Is it a subjective 
attribute in the domain of the observer or is it an intrinsic property of the physical 
system we study? Let us consider the common definition of entropy as a measure 
of disorder. This definition can be confusing unless we are careful in spelling out 
what we mean by order or disorder. We may for instance look at the crystallization 
of a supercooled liquid under conditions where it is a closed system, i.e. when no 
energy is exchanged with the environment. Initially the molecules of the liquid 
are free to randomly move about, but then (often through the addition of a small 
perturbation that breaks the symmetry) the liquid suddenly turns into a solid by 
forming a crystal in which the molecules are pinned to the sites of a regular lattice. 
From one point of view this a splendid example of the creation of order out of 
chaos. Yet from standard calculations in statistical mechanics we know that the 
entropy increases during crystallization. This is because what meets the eye is only 
part of the story. During crystallization entropy is generated in the form of latent 
heat, which is stored in the vibrational modes of the molecules in the lattice. Thus, 
even though in the crystal the individual molecules are constrained to be roughly 
in a particular location, they vibrate around their lattice sites more energetically 
than when they were free to wander. From a microscopic point of view there are 
more accessible states in the crystal than there were in the liquid, and thus the 
entropy increases. The thermodynamic entropy is indifferent to whether motions 
are microscopic or macroscopic - it only counts the number of accessible states and 
their probabilities. 

In contrast, to measure the sense in which the crystal is more orderly, we must 
measure a different set of probabilities. To do this we need to define probabilities 
that depend only on the positions of the particles and not on their velocities. To 
make this even more clear-cut, we can also use a more macroscopic partition, large 
enough so that the thermal motions of a molecule around its lattice site tend to 
stay within the same partition element. The entropy associated with this set of 
probabilities, which we might call the "spatial order entropy" , will behave quite 
differently from the thermodynamic entropy. For the liquid, when every particle 
is free to move anywhere in the container, the spatial order entropy will be high, 
essentially at its largest possible value. After the crystallization occurs, in contrast, 
the spatial order entropy will drop dramatically. Of course, this is not the ther- 
modynamic entropy, but rather an entropy that we have designed to quantitatively 
capture the aspect of the crystalline order that we intuitively perceive. 

As we emphasized before, Shannon's great insight was that it is possible to 
associate an entropy with any set of probabilities. However, the example just given 
illustrates that when we use entropy in the broader sense of Shannon we must 
be very careful to specify the context of the problem. Shannon entropy is just a 
function that reduces a set of probabilities to a number, reflecting how many nonzero 
possibilities there are as well as the extent to which the set of nonzero probabilities 
is uniform or concentrated. Within a fixed context, a set of probabilities that is 
smaller and more concentrated can be interpreted as more "orderly" , in the sense 
that fewer numbers are needed to specify the set of possibilities. Thermodynamics 
dictates a particular context - we have to measure probabilities in the full state 
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space. Thermodynamic entropy is a special case of Shannon entropy In the more 
general context of Shannon, in contrast, we can define probabilities however we 
want, depending on what we want to do. But to avoid confusion we must always 
be careful to keep this context in mind, so that we know what our computation 
means. 



5.4. Maxwell's demon. 

The "being" soon came to be called Maxwell's demon, because of its far- 
reaching subversive effects on the natural order of things. Chief among these 
effects would be to abolish the need for energy sources such as oil, uranium and 
sunlight. 

C.H. Bennett 

The second law of thermodynamics is statistical, deriving from the fact that the 
individual motions of the molecules are not observed or controlled in any way. 
Would things be different if we could intervene on a molecular scale? This question 
gives rise to an important paradox posed by Maxwell in 1872, which appeared in 
his Theory of Heat [51 . This has subsequently been discussed by generations of 
physicists, notably Szilard [ST], Brillouin|13j. Landauer [H], Bennett [I] and others. 

Maxwell described his demonic setup as follows: "Let us suppose that a vessel 
is divided in two portions, A and B, by a division in which there is a small hole, 
and that a being who can see individual molecules opens and closes this hole, so 
as to allow only the swifter particles to to pass from A to B, and only the slower 
ones to pass from B to A. He will thus, without expenditure of work, raise the 
temperature of B and lower that of A, in contradiction with the second law of 
thermodynamics." In attempts to save the second law from this demise, many 
aspects of the problem have been proposed for its resolution, including Brownian 
motion, quantum uncertainty and even Godel's Theorem. The resolution of the 
paradox touches on some very fundamental issues that center on the question of 
how the demon might actually realize his subversive interventions. 

Szilard clarified the discussion by introducing an engine (or thermodynamic cy- 
cle) , which is depicted in the left half of figure [6] He and Brillouin focused on the 
measurement the demon has to perform in order to find out in which half of the 
vessel the particle is located after the partition has been put into place. For the 
demon to "see" the actual molecules he has to use a measurement device, such as 
a source of light (photons) and a photon detector. He will in principle be able to 
measure whether a molecule is faster or slower then the thermal average by scat- 
tering a photon off of it. Brillouin tried to argue that the entropy increase to the 
whole system once the measurement is included would always be larger or equal 
then the entropy gain achieved by the subsequent actions of the demon. However, 
this argument didn't hold; people were able to invent devices that got around the 
measurement problem, so that it appeared the demon could beat the second law. 

Instead, the resolution of the paradox comes from a very different source. In 
1982 Bennett gave a completely different argument to rescue the second law. The 
fundamental problem is that under Landauer's principle, production of heat is nec- 



essary for erasure of information (see section 5.2). Bennett showed that a reversible 
measurement could in principle be made, so that Brillouin's original argument was 
wrong - measurement does not necessarily produce any entropy. However, to truly 
complete the thermodynamic cycle, the demon has to erase the information he 
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Figure 6. The one-particle Maxwell demon apparatus as envis- 
aged by Bennett [8j|9]. An explanation is given in the text. 



obtained about the location of the gas molecule. As we already discussed in sec- 
tion [5T2J erasing that information produces entropy. It turns out that the work that 
has to be done to erase the demon's memory is at least as much as was originally 
gained. 

Figure [6] illustrates the one-particle Maxwell demon apparatus as envisaged by 
Bennett [U [9], which is a generalization of the engine proposed by Szilard [67] , 
On the left in row (A) is a gas container containing one molecule with a partition 
and two pistons. On the right is a schematic representation of the phase space 
of the system, including the demon. The state of mind of the demon can be in 
three different states: He can know the molecule is on the right (state 0), on the 
left (state 1), or he can be in the reference or blank state r, where he lacks any 
information, and knows that he doesn't know where the particle is. In the schematic 
diagram of the phase space, shown on the right, the vertical direction indicates the 
state of memory of the demon and the horizontal direction indicates the position 
of the particle. In step (B) a thin partition is placed in the container, trapping the 
particle in either the left or right half. In step (C) the demon makes a (reversible) 
measurement to determine the location of the particle. This alters his state of mind 
as indicated - if the particle is on the right, it goes into state 0, if on the left, into 
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state 1. In step (D), depending on the outcome of the measurement, he moves either 
the right or left piston in and removes the partition. In (E) the gas freely expands, 
moving the piston out and thereby doing work. In state (E) it appears as if the 
system has returned to its original state - it has the same volume, temperature and 
entropy - yet work has been done. What's missing? The problem is that in (E) the 
demon's mind has not returned to its original blank state. He needs to know that 
he doesn't know the position of the particle. Setting the demon's memory back into 
its original state requires erasing a bit of information. This is evident in the fact 
that to go from (E) to (F) the occupied portion of the phase space is reduced by a 
factor of two. This reduction in entropy has to be accompanied by production of 
heat as a consequence of Landauer's principle (see figure [4] and figure [5]) - the work 
that is done to erase a bit of information is greater than or equal to the work gained 
by the demon. This ensures that the full cycle of the complete system respects the 
second law after all. 

This resolution of the paradox is remarkable, because it is not the acquisition of 
information (the measurement) which is irreversible and thermodynamically costly, 
but it is the process of erasure, which is both logically and thermodynamically 
irreversible, that leads to the increase of entropy required by the second law. The 
information comes for free, but it poses a waste disposal problem which is costly. It 
is gratifying to see information theory come to rescue of one of the most cherished 
physical laws. 

5.5. The Gibbs paradox. The Gibbs paradox provides another interesting chap- 
ter in the debate on the meaning of entropy. The basic question is to what extent 
entropy is a subjective notion. In its simplest form the paradox concerns the mixing 
of two ideal gases (kept at the same temperature and pressure) after removing a 
partition. If it has been removed the gases will mix, and if the particles of the two 
gases are distinguishable the entropy will increase due to this mixing. However, if 
the gases are identical, so that their particles are indistinguishable from those on 
the other side, there is no increase in the entropy. Maxwell imagined the situation 
where the gases were initially supposed to be identical, and only later recognized 
to be different. This reasoning led to the painful conclusion that the notion of ir- 
reversibility and entropy would depend on our knowledge of physics. He concluded 
that the entropy would thus depend on the state of mind of the experimenter and 
therefore lacked an objective ground. It was again Maxwell with a simple ques- 
tion who created an uncomfortable situation which caused a long debate. After 
the development of quantum mechanics, it became clear that particles of the same 
species are truly indistinguishable. There is no such thing as labeling N individual 
electrons, and therefore interchanging electrons doesn't change the state and this 
fact reduces the number of states by a relative factor of N!. Therefore the conclu- 
sion is that the entropy does not increase when the gases have the same constituent 
particles, and it does increase when they are different. 

However, the resolution of Gibbs paradox does not really depend on quantum 
mechanics. Jaynes has emphasized that in the early works of Gibbs, the correct 
argument was already given (well before the advent of quantum mechanics) [55] , 
Gibbs made an operational definition, saying that if "identical" means anything, it 
means that there is no way an "unmixing" apparatus could determine whether a 
particular molecule came from a given side of the box, short of having followed its 
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entire trajectory. Thus if the particles of the gas are identical in this sense, the en- 
tropy will not change. We conclude that the adequate definition of entropy reflects 
the objective physical constraints we put on the system, i.e. what measurements 
are possible or admissible. This has nothing to do with our lack of knowledge 
but rather with our choices. The 'incompleteness of our knowledge' is an exact 
and objective reflection of a particular set of macroscopic constraints imposed on 
the physical system we want to describe. The system's behavior depends on these 
constraints, and so does the entropy. 

5.6. The maximal entropy principle of Jaynes. 

The statistical practice of physicists has tended to lag about 20 years behind 
current developments in the field of basic probability and statistics. 

E.T. Jaynes (1963) 

There are two equivalent sets of postulates that can be used as a foundation to 
derive an equilibrium distribution in statistical mechanics. One is to begin with the 
hypothesis that equilibrium corresponds to a minimum of the free energy, and the 
other is that it corresponds to a maimum of the entropy. The latter approach is a 
relatively modern development. Inspired by Shannon, Jaynes turned the program 
of statistical mechanics upside down |38j . Starting from a very general set of axioms 
he showed that under the assumption of equilibrium the Gibbs expression for the 
entropy is unique. Under Jaynes' approach, any problem in equilibrium statistical 
mechanics is reduced to finding the set of pi for which the entropy is maximal, under 
a set of constraints that specify the macroscopic conditions, which may come from 
theory or may come directly from observational data |37j . This variational approach 
removes some of the arbitrariness that was previously present in the foundations 
of statistical mechanics. The principle of maximum entropy is very simple and has 
broad application. For example if one maximizes S only under the normalization 
condition ^2^Pi = 1> then one finds the unique solution that pi — 1/N with N 
the total number of states. This is the uniform probability distribution underlying 
the equipartition principle. Similarly, if we now add the constraint that energy is 
conserved, i.e. J2 i EiPi = U, then the unique solution is given by the Boltzmann 



distribution, equation (3.51. The maximum entropy principle as a starting point 
clearly separates the physical input and purely probabilistic arguments that enter 
the theory. Let us derive the Maxwcll-Boltzmann distribution to illustrate the 
maximal entropy principle. We start with the function L(jpi, a, (3) which depends on 
the probability distribution and two Lagrange multipliers to impose the constraints: 

N N N 

L( Pi ,a,[3) = -J2Pi^Pi ~ a£ft - !) - ~ ^ ^ 
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The maximum is determined by setting the partial derivatives of L equal zero: 
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From the first equation we immediately obtain that: 

Vl = e -(i+a+/te«) =>. Vl = 7e -fe (5.6) 

The parameters 7 and (3 are determined by the constraint equations. If we first 
substitute the above solution in the normalisation constraint, and then use the 



defining equation for the partition sum (3.6 1, we find that 7 = \/Z. The solution 



for /3 is most easily obtained using the following argument. First substitute (5.6) 



in the definition (3.13) of S to obtain the relation: 

S = (3U - const. . (5.7) 

Next we use the thermodynamic relation between energy and entropy ( |2.3| . from 
which we obtain hat dU/dS = T. Combining these two relations we find that 



(3 = l/T, which yields the thermal equilibrium distribution (3.5). 

The maximal entropy formalism has a much wider validity than just statisti- 
cal mechanics. It is widely used for statistical inference in applications such as 
optimizing data transfer and statistical image improvement. In these contexts it 
provides a clean answer to the question, "given the constraints I know about in the 
problem, what is the model that is as random as possible (i.e. minimally biased) 
subject to these constraints?" . A common application is missing data: Suppose one 
observes a series of points Xi at regular time intervals, but some of the observations 
are missing. One can make a good guess for the missing values by solving for the 
distribution that maximizes the entropy, subject to the constraints imposed by the 
know data points. 

One must always bear in mind, however, that in physics the maximum entropy 
principle only implies to equilibrium situations, which are only a small subset of 
the problems in physics. For systems that are not in equilibrium one must take a 
different approach. Attempts to understand non-equilibrium statistical mechanics 
have led some researchers to explore the use of alternative notions of entropy, as 
discussed in Section f5. Ill 

5.7. Ockham's razor. 

Entia non sunt multiplicanda praeter neccessitatcm 

(Entities should not be introduced except when strictly necessary) 

William van Ockham (1285-1347) 

An interesting and important application of information is to the process of 
modeling itself. When developing a model it is always necessary to make a tradeoff 
between models that are too simple and fail to explain the data properly, and models 
that are too complicated and fit fluctuations in the data that are really just noise. 
The desirability of simpler models is often called "Ockham's razor" : If two models 
fit the existing data equally well, the simplest model is preferable, in the sense that 
the simpler model is more likely to make good predictions for data that has not yet 
been seen. While the value of using simple models seems like something we can all 
agree on, the tradeoff in real problems is typically not so obvious. Suppose model 
A fits the data a little better than model B, but has one more parameter. How 
does one trade off goodness of fit against number of parameters? 

Using ideas from information theory Akaike [2J introduced a method for making 
tradeoffs between goodness of fit and model complexity that can be applied in the 
context of simple linear models. Risscnen subsequently introduced a more general 
framework to think about this problem based on a principle that he called minimum 



31 



description length (MDL) [58, 29, 30J . The basic idea is that the ability to make 
predictions and the ability to compress information are essentially two sides of 
the same coin. We can only compress data if it contains regularities, i.e. if the 
structure of the data is at least partially predictable. We can therefore find a good 
prediction model by seeking the model that gives the shortest description of the 
data we already have. When we do this we have to take the description length of 
the model into account, as well as the description length of the deviations between 
the model's predictions and the actual data. The deviations between the model and 
the data can be treated as probabilistic events. A model that gives a better fit has 
less deviation from the data, and hence implies a tighter probability distribution, 
which translates into a lower entropy for the deviations from the data. This entropy 
is then added to the information needed to specify the model and its parameters. 
The best model is the one with the lowest sum, i.e. the smallest total description 
length. By characterizing the goodness of fit in terms of bits, this approach puts the 
complexity of the model and the goodness of fit on the same footing, and gives the 
correct tradeoff between goodness of fit and model complexity, so that the quality 
of any two models can be compared, at least in principle. 

This shows how at some level the concept of entropy underlies the whole scientific 
method, and indeed, our ability to make sense out of the world. To describe the 
patterns in the world, we need to make a trade-off between overfitting (fitting every 
bump even if it is a random variation, i.e. fitting noise) and overgeneralization 
(identifying events that really are different). A similar trade-off arises in assigning 
a causal mechanism to the occurrence of an event or explaining it as random. This 
problem of how to exactly make such trade-offs based on time series analysis has 
a rather long history but on the other hand is still an active topic of research 
[H E3 ES] . Even if we do not do these trade-offs perfectly and do not think about 
it quantitatively, when we discover and model regularities in the world, we are 
implicitly relying on a model selection process of this type. Any generalization 
makes a judgment that trades off the information needed to specify the model and 
the entropy of the fit of the model to the world. 



5.8. Coarse graining and irreversibility. 

Our aim is not to 'explain irreversibility' but to describe and predict the ob- 
servable facts. If one succeeds in doing this correctly, from first principles, we 
will find that philosophical questions about the 'nature of irreversibility' will 
either have been answered automatically or else will be seen as ill considered 
and irrelevant. 

E.T. Jaynes 

The second law of thermodynamics says that for a closed system the entropy will 
increase until it reaches its equilibrium value. This corresponds to the irreversibility 
we all know from daily experience. If we put a drop of ink in a glass of water the 
drop will diffuse through the water and dilute until the ink is uniformly spread 
through the water. The increase of entropy is evident in the fact that the ink is 
initially in a small region, with pi = except for this region, leading to a probability 
distribution concentrated on a small region of space and hence a low entropy. The 
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system will not return to its original configuration. Although this is not impossible 
in principle, it is so improbable that it will never be observeof^j 

Irreversibility is hard to understand from the microscopic point of view because 
the microscopic laws of nature that determine the time evolution of any physical 
system on the fundamental level are all symmetric under time reversal. That is, 
the microscopic equations of physics, such as F = ma, are unchanged under the 
substitution t — > —t. How can irreversibility arise on the macroscopic level if it has 
no counterpart on the microscopic level? 

In fact, if we compute the entropy at a completely microscopic level it is con- 
served, which seems to violate the second law of thermodynamics. This follows from 
the fact that momentum is conserved, which implies that volumes in phase space 
are conserved. This is called Liouville's theorem. It is easy to prove that this im- 
plies that the entropy 5 is conserved. This doesn't depend on the use of continuous 
variables - it only depends on applying the laws of physics at the microscopic level. 
It reflects the idea of Laplace, which can be interpreted as a statement that sta- 
tistical mechanics wouldn't really be necessary if we could only measure and track 
all the little details. The ingenious argument that Gibbs used to clarify this, and 
thereby to reconcile statistical mechanics with the second law of thermodynamics, 
was to introduce the notion of coarse graining. This procedure corresponds to a 
systematic description of what we could call "zooming out" . As we have already 
mentioned, this zooming out involves dividing phase space up in finite regions 8 
according to a partition II. Suppose, for example, that at a microscopic level the 
system can be described by discrete probabilities p t for each state. Let us start 
with a closed system in equilibrium, with a uniform distribution over the accessible 
states. For the Ising system, for example, pi = l/g{N,%) is the probability of a 
particular configuration of spins. Now we replace in each little region 8 the values 
of pi by its average value pi over 8: 

pi = pi ' ( 5 - 8 ) 

and consider the associated coarse grained entropy 

5 = - pi In ^. (5.9) 

i 

Because we start at time t — with a uniform probability distribution, 5(0) = 5(0). 
Next we change the situation by removing a constraint of the system so that it is 
no longer in equilibrium. In other words, we enlarge the space of accessible states 
but have as an initial condition that the probabilities are zero for the new states. 
For the new situation we still have that 5(0) = 5(0), and now we can compare the 



^ "Never say never" is a saying of unchallenged wisdom. What we mean here by "never" , is 
inconceivably stronger then "never in a lifetime" , or even "never in the lifetime of the universe" . 
Let's make a rough estimate: consider a dilute inert gas, say helium, that fills the left half of a 
container of volume V. Then we release the gas into the full container and ask what the recurrence 
time would be, i.e. how long it would take before all particles would be in the left half again. 
A simple argument giving a reasonable estimate, would be as follows: At any given instant the 
probability for a given particle to be in the left half is 1/2, but since the particles are independent, 
the probability of N ~ N A particles to be in the left half is P = (l/2) lu23 « 10 (_1 ° 20) . Assuming 
a typical time scale for completely rearranging all the particles in the container of, say, to = 
W~ 3 seconds, the typical time that will pass before such a fluctuation occurs is T = tq/P = 
10 10 2 ° 10 -3 w 10 10 20 S£C . 
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evolution of the fine-grained entropy S(t) and the coarse-grained entropy S(t). The 
evolution of S(t) is governed by the reversible microscopic dynamics and therefore 
it stays constant, so that S(t) = 5(0). To study the evolution of the coarse-grained 
entropy we can use a few simple mathematical tricks. First, note that because pi 
is constant over each region with 5 elements, 

i i 

Then we may write 

S(t) - 5(0) = y>(ln P< - In ft) - V Pl In § = J>(^ In (5.11) 

i i Pi i Pi Pi 

which in information theory is called he Kullback-Lcibler divergence. The mathe- 
matical inequality x In x > (x—1), with a; = Pi/pi, then implies the Gibbs inequality: 

5(i)-5(0) = !- 1 = 0- (5-12) 

i i 

Equality only occurs if Pi/pi = 1 throughout, so except for the special case where 
this is true, this is a strict inequality and the entropy increases. We see how the 
second law is obtained as a consequence of coarse graining. 

The second law describes mathematically the irreversibility we witness when 
somebody blows smoke in the air. Suppose we make a film of the developing smoke 
cloud. If we film the movie at an enormous magnification, so that what we see are 
individual particles whizzing back and forth, it will be impossible to tell which way 
the movie is running - from a statistical point of view it will look the same whether 
we run the movie forward or backward. But if we film it at a normal macroscopic 
scale of resolution, the answer is immediately obvious the direction of increasing 
time is clear from the diffusion of the smoke from a well-defined thin stream to a 
diffuse cloud. 

From a philosophical point of view one should ask to what extent coarse graining 
introduces an element of subjectivity into the theory. One could object that the 
way we should coarse grain is not decided upon by the physics but rather by the 
person who performs the calculation. The key point is that, as in so many other 
situations in physics, we have to use some common sense, and distinguish between 
observable and unobservable quantities. Entropy does not increase in the highly 
idealized classical world that Laplace envisioned, as long as we can observe all the 
microscopic degrees of freedom and there are no chaotic dynamics. However, as soon 
as we violate these conditions and observe the world at a finite level of resolution 
(no matter how accurate), chaotic dynamics ensures that we will lose information 
and entropy will increase. While the coarse graining may be subjective, this is not 
surprising measurements arc inherently subjective operations. In most systems 
one will have that the entropy may stabilize on plateaus corresponding to certain 
ranges of the fineness of the coarseness. In many applications the increase of en- 
tropy will therefore be constant (i.e. well defined) for a sensible choice for the scale 
of coarse graining. The increase in (equilibrium) entropy between the microscopic 
scale and the macroscopic scale can also be seen as the amount of information that 
is lost by increasing the graining scale from the microscopic to the macroscopic. A 
relevant remark at this point is that a system is of course never perfectly closed - 
there are always small perturbations from the environment that act as a stochastic 
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perturbation of the system, thereby continuously smearing out the actual distribu- 
tion in phase space and simulating the effect of coarse graining. Coarse graining 
correctly captures the fact that entropy is a measure of our uncertainty; the fact 
that this uncertainty does not exist for regular motions and perfect measurements 
is not relevant to most physical problems. 

5.9. Coarse graining and renormalization. In a written natural language not 
all finite combinations of letters are words, not all finite combinations of words are 
sentences, and not all finite sequences of sentences make sense. So by identifying 
what we call meaningful with accessible, what we just said means that compared 
with arbitrary letter combinations, the entropy of a language is extremely small. 

Something similar is true for the structures studied in science. We are used 
to thinking of the rich diversity of biological, chemical and physical structures as 
being enormous, yet relative to what one might imagine, the set of possibilities 
is highly constrained. The complete hierarchy starting from the most elementary 
building blocks of matter such as leptons and quarks, all the way up to living 
organisms, is surprisingly restricted. This has to do with the very specific nature of 
the interactions between these building blocks. To our knowledge at the microscopic 
level there are only four fundamental forces that control all interactions. At each 
new structural level (quarks, protons and neutrons, nuclei, atoms, molecules, etc) 
there is a more or less autonomous theory describing the physics at that level 
involving only the relevant degrees of freedom at that scale. Thus moving up a level 
corresponds to throwing out an enormous part of the phase space available to the 
fundamental degrees of freedom in the absence of interactions. For example, at the 
highest, most macroscopic levels of the hierarchy only the long range interactions 
(electromagnetism and gravity) play an important role - the structure of quantum 
mechanics and the details of the other two fundamental forces are more or less 
irrelevant. 

We may call the structural hierarchy we just described "coarse graining" at large. 
Although this ability to leave the details of each level behind in moving up to the 
next is essential to science, there is no cut and dried procedure that tells us how 
to do this. The only exception is that in some situations it is possible to do this 
coarse graining exactly by a procedure called renormalization |75j . This is done by 
systematically studying how a set of microscopic degrees of freedom at one level can 
be averaged together to describe the degrees of freedom at the next level. There 
are some situations, such as phase transitions, where this process can then be used 
repeatedly to demonstrate the existence of fixed points of the mapping from one 
level to the next (an example of a phase transition is the change from a liquid 
to a gas). This procedure has provided important insights in the nature of phase 
transitions, and in many cases it has been shown that some of their properties are 
universal, in the sense that they do not depend on the details of the microscopic 
interactions. 

5.10. Adding the entropy of subsystems. Entropy is an extensive quantity. 
Generally speaking the extensivity of entropy means that it has to satisfy the fun- 
damental linear scaling property 

S(T, qV, qN) — qS (T, V, N) , < q < oo. (5.13) 

Extensivity translates into additivity of entropies: If we combine two noninteracting 
systems (labelled 1 and 2) with entropies Si and 52, then the total number of states 
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will just be the product of those of the individual systems. Taking the logarithm, 
the entropy of the total system S becomes: 

S=Si + S 2 . (5.14) 

Applying this to two spin systems without an external field, the number of states 
of the combined system is w = 2 Nl+N2 , i.e. w — w\ w 2 . Taking the logarithm 
establishes the additivity of entropy. 

However if we allow for a nonzero magnetic field, this result is no longer obvious. 
In Section |3.2| we calculated the number of configurations with a given energy 
Ek = —kfiH as g(N, k). If we now allow two systems to exchange energy but keep 
the total energy fixed, then this generates a dependence between the two systems 
that lowers the total entropy. We illustrate this with an example: 

Let the number of spins pointing up in system 1 be fci and the number of particles 
be iVi, and similarly let this be k 2 and -/V2 for system 2. The total energy k = ki + k 2 
is conserved, but the energy in cither subsystem {k\ and fca) is not conserved. The 
total number of spins, N — N\ + N 2 is fixed, and so are the spins (Ni and N 2 ) 
in either subsystem. Because the systems only interact when the number of up 
spins in one of them (and hence also the other one) changes, we can write the total 
number of states for the combined system as 

g(N,k) = Y l 9i(N 1 ,k 1 )g 2 (N 2 ,k 2 ), (5.15) 

fei 

where we are taking advantage of the fact that as long as k\ is fixed, systems one 
and two are independent. Taking the log of the above formula clearly does not lead 
to the additivity of entropies because we have to sum over k\ . This little calculation 
illustrates the remark made before: Since we have relaxed the constraint that each 
system has a fixed energy to the condition that only the sum of their energies 
is fixed, the number of accessible states for the total system is increased. The 
subsystems themselves are no longer closed and therefore the entropy will change. 
The extensivity of entropy is recovered in the thermodynamic limit in the above 



example, i.e. when N — > 00. Consider the contributions to the sum in (5.151 as a 
function of k\ , and let the value of k\ where g reaches a maximum be k\ — k\. We 
can now write the contribution in the sum in terms 5 = k\ — k\ as 

Ag(N,k) =g 1 (N 1 , h+6)g 2 (N 2l k 2 - S) = f(S)g 1 (N u k 1 )g 2 (N 2l k 2 ) , (5.16) 

where the correction factor can be calculated by expanding the g functions around 
their respective k values. Not surprisingly, in the limit where N is large it turns 
out that / is on the order of / ~ exp(— 2S 2 ) so that the contributions to g(N, k) of 



the nonmaximal terms in the sum (5.15 1 are exponentially suppressed. Thus in the 
limit that the number of particles goes to infinity the entropy becomes additive. 
This exercise shows that when a system gets large we may replace the averages of 
a quantity by its value in the most probable configuration, as our intuition would 
have suggested. From a mathematical point of view this result follows from the fact 
that the binomial distribution approaches a gaussian for large values of N, which 
becomes ever sharper as N — > 00. This simple example shows that the extensivity of 
entropy may or may not be true, depending on the context of the physical situation 
and in particular on the range of the inter-particle forces. 

When two subsystems interact, it is certainly possible that the entropy of one 
decreases at the expense of the other. This can happen, for example, because system 
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one does work on system two, so the entropy of system one goes up while that of 
system two goes down. This is very important for living systems, which collect free 
energy from their environment and expel heat energy as waste. Nonetheless, the 
total entropy S of an organism plus its environment still increases, and so does the 
sum of the independent entropies of the non interacting subsystems. That is, if at 
time zero 

S(0) = 5 1 (0)+5 2 (0), (5.17) 
then at time t it may be true that 

S(t) <S 1 (t) + S 2 (t) , (5.18) 

This is due to the fact that only interactions with other parts of the system can 
lower the entropy of a given subsystem. In such a situation we are of course free 
to call the difference between the entropy of the individual systems and their joint 
entropy a negative correlation entropy. However, despite this apparent decrease of 
entropy, both the total entropy and the sum of the individual entropies can only 
increase, i.e. 

S(t) > 5(0) (5.19) 
Si{t) + S 2 (t) > S 1 (0) + S 2 (0). 
The point here is thus that equations ( |5.18[ ) and (5.19) are not in conflict. 



5.11. Beyond the Boltzmann, Gibbs and Shannon entropy: the Tsallis 
entropy . 

The equation S = k log W + const appears without an elementary theory - or 
however one wants to say it - devoid of any meaning from a phenomenological 
point of view. 

A. Einstein (1910) 

As we have already stressed, the definition of entropy as — ^^^Pilogpi and the 
associated exponential distribution of states apply only for systems in equilibrium. 
Similarly, the requirements for an entropy function as laid out by Shannon and 
Khinchin are not the only possibilities. By modifying these assumptions there are 
other entropies that are useful. We have already mentioned the Renyi entropy, 
which has proved to be valuable to describe multi-fractals. 

Another context where considering an alternative definition of entropy appears to 
be useful concerns power laws. Power laws are ubiquitous in both natural and social 
systems. A power lawj^jis something that behaves for large x as f(x) ~ x~ a , with 
a > 0. Power law probability distributions decay much more slowly for large values 
of x than exponentials, and as a result have very different statistical properties 
and are less well-behavecp'} Power law distributions are observed in phenomena as 
diverse as the energy of cosmic rays, fluid turbulence, earthquakes, flood levels of 
rivers, the size of insurance claims, price fluctuations, the distribution of individual 
wealth, city size, firm size, government project cost overruns, film sales, and word 
usage frequencies [5H [21]. Many different models can produce power laws, but so 
far there is no unifying theory, and it is not yet clear whether any such unifying 
theory is even possible. It is clear that power laws (in energy, for instance) can't be 



1! Tt is also possible to have a power law at zero or any other limit, and to have a < 0, but for 
our purposes here most of the examples of interest involve the limit x — > oo and positive a. 

The m th moment f x m p(x)dx of a power law distribution p(x) ~ x~ Q does not exist when 
m > a. 
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explained by equilibrium statistical mechanics, where the resulting distributions are 
always exponential. A common property of all the physical systems that are known 
to have power laws and the models that purport to explain them is that they are in 
some sense nonequilibrium systems. The ubiquity of power laws suggests that there 
might be nonequilibrium generalizations of statistical mechanics for which they are 
the standard probability distribution in the same way that the exponential is the 
standard in equilibrium systems. 

From simulations of model systems with long-range interactions (such as stars 
in a galaxy) or systems that remain for long periods of time at the "edge of chaos" , 
there is mounting evidence that such systems can get stuck in nonequilibrium meta- 
stable states with power law probability distributions for very long periods of time 
before they finally relax to equilibrium. Alternatively, power laws also occur in 
many driven systems that are maintained in a steady state away from equilibrium. 
Another possible area of applications is describing the behaviour of small subsys- 
tems of finite systems. 

From a purely statistical point of view it is interesting to ask what type of en- 
tropy functions are allowed. The natural assumption to alter is the last of the 
Khinchin postulates as discussed in Section 5.2. The question becomes which en- 
tropy functions satisfy the remaining two conditions, and some sensible alternative 
for the third? It turns out that there is at least one interesting class of solutions 
called q-entropies introduced in 1988 by Tsallis [69j [26] . The parameter q is usually 
referred to as the bias or correlation parameter. For q ^ 1 the expression for the 
q-entropy S q is 

1 -T P q 

S q [p] ee y . (5.20) 

For q = 1, S q reduces to the standard Gibbs entropy by taking the limit as q — > 1. 
Following Jaynes's approach to statistical mechanics, one can maximize this entropy 
function under suitable constraints to obtain distribution functions that exhibit 
power law behavior for q ^ 1. These functions are called q-exponentials and are 
defined as 

e (x) - { [1 + (1 - q)x]W-<* (1 + (1 - q)x)0) 

An important property of the q-exponential function is that for q > 1 and x <C — 1 
it has a power law decay. The inverse of the q-exponential is the \n q {x) function 

T }-1 _ 1 

ln g = — . (5.22) 
The q-exponential can also be obtained as the solution of the equation 

I— ■ (5 - 23 » 

This is the typical behavior for a dynamical system at the edge of linear stability, 
where the first term in its Taylor series vanishes. This gives some alternative 
insight into one possible reason why such solutions may be prevalent . Other typical 
situations involve long range interactions (such as the gravitational interactions 
between stars in galaxy formation) or nonlinear generalizations of the central limit 
theorem [72] for variables with strong correlations. 
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At first sight a problem with q-entropies is that for q =/= 1 they are not additive. 
In fact the following equality holds: 

S ? [p (1 V 2) ] = S q [p^} + S q \pW] + (1 - q)S q [p^]S q [p^} (5.24) 
with the corresponding product rule for the q-exponentials: 

e q( X ) e q(y) = e q( X + V + ( l ~ l) X V) ( 5 - 25 ) 

This is why the q-entropy is often referred to as a non-extensive entropy. However, 
this is in fact a blessing in disguise. If the appropriate type of scale invariant corre- 
lations between subsystems are typical, then the q-entropies for g ^ 1 are strictly 
additive. When there are sufficiently long-range interactions Shannon entropy is 
not extensive; Tsallis entropy provides a substitute by that is additive (under the 
right class of long-range interactions), thereby capturing an underlying regularity 
with a simple description. 

This alternative statistical mechanical theory involves another convenient defi- 
nition which makes the whole formalism look like the "old" one. Motivated by the 
fact that the Tsallis entropy weights all probabilities according to p|, it is possible 
to define an "escort" distribution 

P -' ,S M*- (5 " 26) 

as introduced by Beck [B]. One can then define the corresponding expectation 
values of a variable A in terms of the escort distribution as 

(A) q = j2 p i q)A *- ( 5 - 27 ) 

i 

With these definitions the whole formalism runs parallel to the Boltzmann-Gibbs 
program. 

One can of course ask what the Tsallis entropy "means". The entropy S q is 
a measure of lack of information along the same lines as the Boltzmann-Gibbs- 
Shannon entropy is. In particular, perfect knowledge of the microscopic state of 
the system yields S q — 0, and maximal uncertainty (i.e., all W possible microscopic 
states are equally probable) yields maximal entropy, S q = ln g W. The question 
remains how generic such correlations are and which physical systems exhibit them, 
though at this point quite a lot of empirical evidence is accumulating to suggest that 
such functions are at least a good approximation in many situations. In addition 
recent results have shown that q-exponentials obey a central limit-like behavior for 
combining random variables with appropriate long-range correlations. 

A central question is what determines ql There is a class of natural, artificial 
and social systems for which it is possible to choose a unique value of q such that 
the entropy is simultaneously extensive (i.e., S q {N) proportional to the number of 
elements N, N 3> 1) and there is finite entropy production per unit time (i.e., S q (t) 
proportional to time t, t ^> 1) [711170]. It is possible to acquire some intuition about 
the nature and meaning of the index q through the following analogy: If we consider 
an idealized planar surface, it is only its d = 2 Lebesgue measure which is finite; the 
measure for any d > 2 vanishes, and that for any d < 2 diverges. If we have a fractal 
system, only the d = df measure is finite, where df is the Hausdorff dimension; any 
d > df measure vanishes, and any d < df measure diverges. Analogously, only for 
a special value of q does the entropy S q match the thermodynamical requirement of 
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extensivity and the equally physical requirement of finite entropy production. The 
value of q reflects the geometry of the measure in phase space on which probability 
is concentrated. 

Values of q differing from unity are consistent with the recent q-generalization 
of the Central Limit Theorem and the alpha-stable (Levy) distributions. Indeed, if 
instead of adding a large number of exactly or nearly independent random variables, 
we add globally correlated random variables, the attractors shift from Gaussians 
and Levy distributions to q-Gaussian and (g,ct)-stable distributions respectively 

I531E21IZ3]- 

The framework described above is still in development. It may turn out to be 
relevant to 'statistical mechanics' not only in nonequilibrium physics, but also in 
quite different arenas, such as economics. 

6. Quantum information 

Until recently, most people thought of quantum mechanics in terms of the 
uncertainty principle and unavoidable limitations on measurement. Einstein 
and Schrodinger understood early on the importance of entanglement, but 
most people failed to notice, thinking of the EPR paradox as a question for 
philosophers. The appreciation of the positive application of quantum effects 
to information processing grew slowly. 

Nicolas Gisin 

Quantum mechanics provides a fundamentally different means of computing, 
and potentially makes it possible to solve problems that would be intractable on 
classical computers. For example, with a classical computer the typical time it 
takes to factor a number grows exponentially with the size of the number, but 
using quantum computation Shor has shown that this can be done in polynomial 
time [55]. Factorization is one of the main tools in cryptography, so this is not 
just a matter of academic interest. To see the huge importance of exponential vs. 
polynomial scaling, suppose an elementary computational step takes At seconds. If 
the number of steps increases exponentially, factorizing a number with N digits will 
take At exp(aN) seconds, where a is a constant that depends on the details of the 
algorithm. For example, if At = 1CP 6 and a = 1CP 2 , factoring a number with N — 
10,000 digits will take 10 37 seconds, which is much, much longer than the lifetime 
of the universe (which is a mere 4.6 x 10 17 seconds). In contrast, if the number 
of steps scales as the third power of the number of digits, the same computation 
takes a'AtN 3 seconds, which with a' = 10~ 2 , is 10 4 seconds or a little under three 
hours. Of course the constants a, a' and At are implementation dependent, but 
because of the dramatic difference between exponential vs. polynomial scaling, for 
sufficiently large N there is always a fundamental difference in speed. In fact for 
the factoring problem as such, the situation is more subtle: at present the best 
available classical algorithm requires exp(0(n 1 / 3 log 2 / 3 n)) operations, whereas the 
best available quantum algorithm would require 0{n 2 log n log log n) operations. 
Factorization is only one of several problems that could potentially benefit from 
quantum computing. The implications go beyond quantum computing, and include 
diverse applications such as quantum cryptography and quantum communication 

[551 HQl G2J US]. 

The possibility for such huge speed-ups comes from the intrinsically parallel 
nature of quantum systems. The reasons for this are sufficiently subtle that it took 
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many decades after the discovery of quantum mechanics before anyone realized 
that its computational properties are fundamentally different. The huge interest in 
quantum computation in recent years has caused a re-examination of the concept 
of information in physical systems, spawning a field that is sometimes referred to 
as "quantum information theory" . 

Before entering the specifics of quantum information and computing, we give a 
brief introduction to the basic setting of quantum theory and contrast it with its 
classical counterpart. We describe the physical states of a quantum systems, the 
definition of quantum observables, and time evolution according to the Schrodingcr 
equation. Then we briefly explain the measurement process, the basics of quan- 
tum teleportation and quantum computation. To connect to classical statistical 
physics we describe the density matrix and the von Neumann entropy. Quantum 
computation in practice involves sophisticated and highly specialized subfields of 
experimental physics which are beyond the scope of this brief review - we have 
tried to limit the discussion to the essential principles. 

6.1. Quantum states and the definition of a qubit. In classical physics we 
describe the state of a system by specifying the values of dynamical variables, for 
example, the position and velocity of a particle at a given instant in time. The time 
evolution is then described by Newton's laws, and any uncertainty in its evolution 



is driven by the accuracy of the measurements. As we described in Section 4.2 
uncertainties can be amplified by chaotic dynamics, but within classical physics 
there is no fundamental limit on the accuracy of measurements - by measuring 
more and more carefully, we can predict the time evolution of a system more and 
more accurately. At a fundamental level, however, all of physics behaves according 
to the laws of quantum mechanics, which are very different from the laws of classical 
physics. At the macroscopic scales of space, time and energy where classical physics 
is a good approximation, the predictions of classical and quantum theories have 
to be roughly the same, a statement that is called the correspondence principle. 
Nonetheless, understanding the emergence of classical physics from an underlying 
quantum description is not always easy. 

The scale of the quantum regime is set by Planck's constant, which has dimen- 
sions of energy x time (or equivalently momentum x length). It is extremely small 
in ordinary unit h = 1.05x 1CP 34 Joule-seconds. This is why quantum properties 
only manifest themselves at very small scales or very low temperatures. One has to 
keep in mind however, that radically different properties at a microscopic scale (say 
at the level of atomic and molecular structure) will also lead to fundamentally dif- 
ferent collective behavior on a macroscopic scale. Most phases of condensed matter 
realized in nature, such as crystals, super, ordinary or semi-conductors or magnetic 
materials, can only be understood from the quantum mechanical perspective. The 
stability and structure of matter is to a large extent a consequence of the quantum 
behavior of its fundamental constituents. 

To explain the basic ideas of quantum information theory we will restrict our 
attention to systems of qubits, which can be viewed as the basic building blocks of 
quantum information systems. The physical state of a quantum system is described 
by a wavefunction that can be thought of a vector in an abstract multidimensional 
space, called a Hilbert space. For our purposes here, this is just a finite dimensional 



We are using the reduced Planck's constant, tl = h/2n. 
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vector space where the vectors have complex rather than real coefficients, and where 
the length of a vector is the usual length in such a space, i.e. the square root of 
the sum of the square amplitudes of its component^) Hilbert space replaces the 
concept of phase space in classical mechanics. Orthogonal basis vectors defining 
the axes of the space correspond to different values of measurable quantities, also 
called observables, such as spin, position, or momentum. 

As we will see, an important difference from classical mechanics is that many 
quantum mechanical quantities, such as position and momentum or spin along the 
a;-axis and spin along the y-axis, cannot be measured simultaneously. Another 
essential difference from classical physics is that the dimensionality of the state 
space of the quantum system is huge compared to that of the classical phase space. 
To illustrate this drastic difference think of a particle that can move along an infinite 
line with an arbitrary momentum. From the classical perspective it has a phase 
space that is two dimensional and real (a position x and a momentum p ), but 
from the quantum point of view it it is given by a wavefunction 'J of one variable 
(typically the position x or the momentum p) . This wave function corresponds to 
an element in an infinite dimensional Hilbert space. 

We discussed the classical Ising spin in section |3.2| It is a system with only two 
states, denoted by s — ±1, called spin up or spin down, which can be thought of 
as representing a classical bit with two possible states, "0" and "1". The quantum 
analog of the Ising spin is a very different kind of animal. Where the Ising spin 
corresponds to a classical bit, the quantum spin corresponds to what is called a 
qubit. As we will make clear in a moment, the state space of a qubit is much 
larger then that of its classical counterpart, making it possible to store much more 
information. This is only true in a certain sense, as one has to take into account to 
what extent the state is truly observable and whether it can be precisely prepared, 
questions we will return to later. 

Any well-defined two level quantum system can be thought of as representing 
a qubit. Examples of two state quantum systems are a photon, which possesses 
two polarization states, an electron, which possesses two possible spin states, or a 
particle in one of two possible energy states. In the first two examples the physical 
quantities in the Hilbert space arc literally spins, corresponding to angular momen- 
tum, but in the last example this is not the case. This doesn't matter - even if 
the underlying quantities have nothing to do with angular momentum, as long as 
it is a two state quantum system we can refer to it as a "spin" . We can arbitrarily 
designate one quantum state as "spin up", represented by the symbol |1), and the 
other "spin down", represented by the symbol |0). 

The state of a qubit is described by a wavefunction or state vector \tp), which 
can be written as 

= a\l) + 0\O) with H 2 + |/?| 2 = 1. (6.1) 

Here a and (3 are complex number^} and thus we can think of as a vector in the 
2-dimcnsional complex vector space, denoted C 2 , and we can represent the state 

^More generally it is necessary to allow for the possibility of infinite dimensions, which intro- 
duces complications about the convergence of series that we do not need to worry about here. 

complex number a has a real and imaginary part a = ai + M2, where ai and a 2 are both 
real, and i is the imaginary unit with the property i 2 = —1. Note that a complex number can 
therefore also be thought of as a vector in a two dimensional real space. The complex conjugate is 
defined as a* = ai — ia,2 and the square of the modulus, or absolute value, as \a\ 2 = a* a = af + a?,. 
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as a column vector . Wc can also define a dual vector space in C 2 with dual 
vectors that can either be represented as row vectors or alternatively be written 

(V| = (0\a* + <l|/3* . (6.2) 

This allows us to define the inner product between two state vectors \tp) and \4>) — 
7 |1) + <5|0) as 

(M>) = <V#>*=7*a + **/?■ (6-3) 

Each additional state (or configuration) in the classical system yields an additional 
orthogonal dimension (complex parameter) in the quantum system. Hence a finite 
state classical system will lead to a finite dimensional complex vector space for the 
corresponding quantum system. 

Let us describe the geometry of the quantum configuration space of a single qubit 
in more detail. The constraint \a\ 2 + |/3| 2 = 1 says that the state vector has unit 
length, which defines the complex unit circle in C 2 , but if we write the complex 
numbers in terms of their real and imaginary parts as a = a\ + ia 2 and (3 = b\ + ib 2 , 
then we obtain |oi + a 2 i\ 2 + \h + b 2 i\ 2 = a\ + a 2 . + b\ + b\ = 1. The geometry of 
the space described by the latter equation is just the three dimensional unit sphere 
S 3 embedded in a four dimensional Euclidean space, R 4 . 

To do any nontrivial quantum computation wc need to consider a system with 
multiple qubits. Physically it is easiest to imagine a system of n particles, each with 
its own spin. (As before, the formalism does not depend on this, and it is possible 
to have examples in which the individual qubits might correspond to other physical 
properties). The mathematical space in which the n qubits live is the tensor product 
of the individual qubit spaces, which we may write as C 2 (£> C 2 <g> ... ® C 2 = C 2 . For 
example, the Hilbert space for two qubits is C 2 (g> C 2 . This is a four dimensional 
complex vector space spanned by the vectors |1) <g> |1), |0) (g> |1), |1) ® |0), and 
|0) ® |0). For convenience wc will often abbreviate the tensor product by omitting 
the tensor product symbols, or by simply listing the spins. For example 

|1)®|0) = |1)|0) = |10). 

The tensor product of two qubits with wave functions \ip) — a|l) + /3|0) and \<f>) — 
7|l>+<5|0)is 

|V) ® |0) = |V)|0) = a 7 |ll) + 7 <5|10) + /3 7 |01) + 06\OO). 

The most important feature of the tensor product is that it is multi-linear, i.e. 
(a|0) + /3|1)) ® \ip) = a\0) ® |V) ® \ip}- Again we emphasize that whereas the 

classical n— bit system has 2™ states, the n— qubit system corresponds to a vector 
of unit length in a 2™ dimensional complex space, with twice as many degrees of 
freedom. For example a three-qubit can be expanded as: 

|V) = ai|000) + a 2 |001) + a 3 |010) + a 4 |011) 

+ a 5 |100) + a 6 |101) + a 7 |110) + a 8 |lll) 

Sometimes it is convenient to denote the state vector by the column vector of its 
components a\, a 2 , a 2 - . 
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6.2. Observables. How are ordinary physical variables such as energy position, 
velocity and spin retrieved from the state vector? In the quantum formalism ob- 
servables are defined as hermitian operators acting on the state space. In quan- 
tum mechanics an operator is a linear transformation that maps one state into 
another, which providing the state space is finite dimensional, can be represented 
by a matrix. A hermitian operator or matrix satisfies the condition A = A\ where 
= (A tr )* is the complex conjugate of the transpose of A. The fact that observ- 
ables are represented by operators reflects the property that measurements may 
alter the state and that outcomes of different measurements may depend on the 
order in which the measurements are performed. In general observables in quan- 
tummcchanics do not necessarily commute, by which we mean that for the product 
of two observables A and B one may have that AB ^ BA. The reason that ob- 
servables have to be hermitian is because the outcome of measurements are the 
eigenvalues of observables, and hermitian operators are guaranteed to have real 
eigenvalues. 

For example consider a single qubit. The physical observables are the compo- 
nents of the spin along the x, y or z directions, which are by convention written 
s x = \o x , s y = \o~ v , etc. The operators a are the Pauli matrices 

** = (J J) . t v = (° "*) , <r t = (J , (6.4) 

which obviously do not commute. In writing the spin operators this way we have 
arbitrarily choserp^| the z-axis to have a diagonal representation, so that the eigen- 
state for spin along the z axis are the column matrices 




6.3. Quantum evolution: the Schrodinger equation. The wave function of 
a quantum system evolves in time according to the famous Schrodinger equation. 
Dynamical changes in a physical system are induced by the underlying forces acting 
on the system and between its constituent parts, and their effect can be represented 
in terms of what is called the energy or Hamiltonian operator H. For a single qubit 
system the operators can be represented as 2 x 2 matrices, for a two qubit system 
they are 4x4 matrices, etc. The Schrodinger equation can be written 

iK^m = H\m)- (6.5) 

This is a linear differential equation expressing the property that the time evolution 
of a quantum system is generated by its energy operator. Assuming that H is 
constant, given an initial state IV'(O)) the solution is simply 

\if)(t)) = U(t)\i/f(0)) with U(t) = e - im / n . (6.6) 



We can rotate into a different representation that makes either of the other two axes diagonal, 
and in which the 2-axis is no longer diagonal — it is only possible to make one of the three axes 
diagonal at a time. Experimental set-ups often have conditions that break symmetry, such as an 
applied magnetic fields, in which case it is most convenient to let the symmetry breaking direction 
be the z-axis. 

21 The eigenstates \xk) of a linear operator A are defined by the equation A|xfc) = ^k\Xk)- If A 
is hermitian the eigenvalue At, is a real number. It is generally possible to choose the eigenstates 
as orthonormal, so that (XjIXfc) = ^jfei where Sij = 1 when i = j and 8ij = otherwise. 
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The time evolution is unitary, meaning that the operator U(t) satisfies UU^ = 1. 

U ] = exp(-iHt/h)i = exp(iHh/h) = exp(iHt/h = U~ l . (6.7) 

Unitary time evolution means that the length of the state vector remains invariant, 
which is necessary to preserve the total probability for the system to be in any of 
its possible states. The unitary nature of the the time evolution operator U follows 
directly from the fact that H is hermitian. Any hermitean 2x2 matrix can be 
written 



where a, b and c are real numbers^) 

For the simple example of a single qubit, suppose the initial state is 

WQ)) = Vf(ID + |o))-/|(! 

On the right, for the sake of convenience, we have written the state as a column 
vector. Consider the energy of a spin in a magnetic field B directed along the 



positive z-axia In this case H is given by H = Bs z . From (6.4) 



Using (|6.6| we obtain an oscillatory time dependence for the state, i.e. 



1 / e -iBt/2h\ 1 



Bt fl\ . . Bt (-1 
cos — , + i sin — 
2h\l 2h\l 



(6.10) 



We thus see that, in contrast to classical mechanics, time evolution in quantum 
mechanics is always linear. It is in this sense much simpler than classical mechanics. 
The complication is that when we consider more complicated examples, for example 
corresponding to a macroscopic object such as a planet, the dimension of the space 
in which the quantum dynamics takes place becomes extremely high. 

6.4. Quantum measurements. Measurement in classical physics is conceptually 
trivial: One simply estimates the value of the classical state at finite precision and 
approximates the state as a real number with a finite number of digits. The accu- 
racy of measurements is limited only by background noise and the precision of the 
measuring instrument. The measurement process in quantum mechanics, in con- 
trast, is not at all trivial. One difference with classical mechanics is that in many 
instances the set of measurable states is discrete, with quantized values for the ob- 
servables. It is this property that has given the theory of quantum mechanics its 
name. But perhaps an even more profound difference is that quantum measurement 
typically causes a radical alteration of the wavefunction. Before the measurement 
of an observable we can only describe the possible outcomes in terms of probabili- 
ties, whereas after the measurement the outcome is known with certainty, and the 
wavefunction is irrevocably altered to reflect this. In the conventional Copenhagen 
interpretation of quantum mechanics the wave function is said to "collapse" when a 
measurement is made. In spite of the fact that quantum mechanics makes spectac- 
ularly successful predictions, the fact that quantum measurements are inherently 



22 We omitted a component proportional to the unit matrix as it acts trivially on any state. 
•^Quantum spins necessarily have a magnetic moment, so in addition to carrying angular 
momentum they also interact with a magnetic field. 
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probabilistic and can "instantly" alter the state of the system has caused a great 
deal of controversy. In fact, one can argue that historically the field of quantum 
computation emerged from thinking carefully about the measurement problem 18J. 

In the formalism of quantum mechanics the possible outcomes of an observable 
quantity A are given by the eigenvalues of the matrix A. For example, the three 
spin operators defined in Eq. 6.4 all have the same two eigenvalues X± = ±1/2. This 
means that the possible outcomes of a measurement of the spin in any direction can 
only be plus or minus one half. This is completely different than a spinning object 
in classical physics, which can spin at any possible rate in any direction. This is 
why quantum mechanics is so nonintuitive! 

If a quantum system is in an eigenstate then the outcome of measurements in the 
corresponding direction is certain. For example, imagine we have a qubit in the state 
with a = 1 and /3 = so = It is then in the eigenstate of s z with eigenvalue 
+2, so the measurement of s z will always yield that value. This is reflected in 
the mathematical machinery of quantum mechanics by the fact that for the spin 
operator in the z— direction, A = s z , the eigenvector with eigenvalue A+ = +1/2 

is |1) = and the eigenvector with A_ = —1/2 is |0) = f ^ J . In contrast, if 

we make measurements in the orthogonal directions to the eigenstate, e.g. A = s x , 
the outcomes become probabilistic. In the example above the eigenvectors of s x 

are \x+) = y|(|l) + |0)) and \x~) = y|(|l) ~ |0))- In general the probability of 
finding the system in a given state in a measurement is computed by first expanding 
the given state into the eigenstates \xk) of the matrix A corresponding to the 
observable, i.e. 

IVO = ^2 "fclXfc) where a k = (Xk\i>)- (6.11) 

k 

The probability of measuring the system in the state corresponding to eigenvalue 
Afe is pk = | cka: | 2 - The predictions of quantum mechanics are therefore probabilistic 
but the theory is essentially different from classical probability theory. On the 
one hand it is clear that a given operator defines a probability measure on Hilbert 
space, however as the operators are non-commuting (like matrices) one is dealing 
with a non-commutative probability theory [34 . It is the non-commutativity of 
observables that gives rise to the intricacies in the quantum theory of measurement. 

Let us discuss an example for clarification. Consider the spin in the x-direction, 
A = s x , and = |1), i.e. spin up in the z-direction. Expanding in eigenstates 

of a x we get \tp) — |1) = \J\\x+) + \J\\X-)- The probability of measuring spin 
up along the n-direction is |a+| 2 = 1/2, and the probability of measuring spin 
down along the n-direction is |a_| 2 = 1/2. We see how probability enters quantum 
mechanics at a fundamental level. The average of an observable is its expectation 
value, which is the weighted sum 

(v>iaiv>> = ki 2a * = E^ A fe- ( 6 - 12 ) 

In the example at hand (a x ) = 0. 

The act of measurement influences the state of the system. If we measure 
s x = +\ and then measure it again immediately afterward, we will get the same 
value with certainty. Stated differently, doing the measurement somehow forces 
the system into the eigenstate \x+), and once it is there, in the absence of further 
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interactions, it stays there. This strange property of measurement, in which the 
wavefunction collapses onto the observed eigenstate, was originally added to the 
theory in an ad hoc manner, and is called the projection postulate. This postulate 
introduces a rather arbitrary element into the theory that appears to be inconsis- 
tent: The system evolves under quantum mechanics according to the Schrodinger 
equation until a measurement is made, at which point some kind of magic associ- 
ated with the classical measurement apparatus takes place, which lies completely 
outside the rest of the theory. 

To understand the measurement process better it is necessary to discuss the 
coupling of a quantum system and a classical measurement apparatus in more detail. 
A measurement apparatus, such as a pointer on a dial or the conditional emission 
of a light pulse, is also a quantum mechanical system. If we treat the measurement 
device quantum mechanically as well, it should be possible to regard the apparent 
"collapse" of the wavefunction as the outcome of the quantum evolution of the 
combined system of the measurement device and the original quantum system under 
study, without invoking the projection postulate. We return to this when we discuss 
decoherence in Section 16.71 . 

Note that a measurement does not allow one to completely determine the state. 
A complete measurement of the two-qubit system yields at most two classical bits 
of information, whereas determining the full quantum state requires knowing seven 
real numbers ( four complex numbers subject to a normalization condition). In this 
sense one cannot just say that a quantum states "contains" much more information 
that its classical counterpart. In fact, due to the non-commutativity of the observ- 
ables, with simultaneous measurements one is able to extract less information than 
from the corresponding classical system. 

There are two ways to talk about quantum theory: If one insists it is a theory of a 
single system, then one has to live with the fact that it only predicts the probability 
of things to happen and as such is a retrenchment from the ideal of classical physics. 
Alternatively one may take the view that quantum theory is a theory that only 
applies to ensembles of particles. To actually measure probability distributions 
one has to make many measurements on "identically prepared" quantum systems. 
From this perspective the dimensionality of Hilbcrt space should be compared to 
that of classical distributions defined over a classical phase space, which makes the 
difference between classical and quantum theories far less dramatic. This raises 
the quest for a theory underlying quantum mechanics which applies for a single 
system. So far nobody has succeeded in producing such a theory, and on the 
contrary, attempts to build such theories based on "hidden variables" have failed. 
The Bell inequalities suggest that such a theory is probably impossible [56 . 

6.5. Multi qubit states and entanglement. When we have more than one qubit 
an important practical question is when and how measurements of a given qubit 
depend on measurements of other qubits. Because of the deep properties of quan- 
tum mechanics, qubits can be coupled in subtle ways that produce consequences 
for measurement that are very different from classical bits. Understanding this has 
proved to be important for the problems of computation and information transmis- 
sion. To explain this we need to introduce the opposing concepts of separability 
and entanglement, which describe whether measurements on different qubits are 
statistically independent or statistically dependent. 
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An n-qubit state is separable if it can be factored into n-single qubit stateq_J 
i.e. if it can be written as n — 1 tensor products of sums of qubits, with each factor 
depending only on a single qubit. An example of a separable two-qubit is 

M = 2(l°°> + I 01 ) + I 10 ) + I 11 )) = s>(l°> + I 1 )) ® d°) + I 1 ))' ( 6 - 13 ) 

If an n-qubit state is separable then measurements on individual qubits are sta- 
tistically independent, i.e. the probability of making a series of measurements on 
each qubit can be written as a product of probabilities of the measurements for 
each qubit. 

An n-qubit state is entangled if it is not separable. An example of an entangled 
two-qubit state is 

|^> = i(|00> + |ll», (6.14) 

which cannot be factored into a single product. For entangled states measurements 
on individual qubits depend on each other. 

We now illustrate this for the two examples above. Suppose we do an experiment 
in which we measure the spin of the first qubit and then measure the spin of the 
second qubit. For both the separable and entangled examples, there is a 50% chance 
of observing either spin up or spin down on the first measurement. Suppose it gives 
spin up. For the separable state this transforms the wave function as 

|(10> + |1» ® (|0) + |1» -> -5=(|1» <g> (|0) + |1» = -5=(|10) + |11)). 

If we now measure the spin of the second qubit, the probability of measuring spin 
up or spin down is still 50%. The first measurement has no effect on the second 
measurement. 

In contrast, suppose we do a similar experiment on the entangled state of equa- 
tion 6.14 and observe spin up in the first measurement. This transforms the wave 
function as 

-L(|00) + |11))— >|H). (6.15) 

(Note the disappearance of the factor l/v2 due to the necessity that the wave 
function remains normalized) . If we now measure the spin of the second qubit we 
are certain to observe spin up! Similarly, if we observe spin down in the first mea- 
surement, we will also observe it in the second. For the entangled example above 
the measurements are completely coupled - the outcome of the first determines the 
second. This property of entangled states was originally pointed out by Einstein, 
Podolski and Rosen |2T], who expressed concern about the possible consequences 
of this when the qubits are widely separated in space. This line of thinking did not 
point out a fundamental problem with quantum mechanics as they perhaps origi- 
nally hoped, but rather led to a deeper understanding of the quantum measurement 
problem and to the practical application of quantum teleportation as discussed in 
Section 

The degree of entanglement of a system of qubits is a reflection of their past 
history. By applying the right time evolution operator, i.e. by introducing appro- 
priate interactions, we can begin with a separable state and entangle it, or begin 
with an entangled state and separate it. Separation can be achieved, for example, 

Strictly speaking this is only true for pure states, which we define in the next section. 
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by applying the inverse of the operator that brought about the entanglement in 
the first place - quantum dynamics is reversible. Alternatively separation can be 
achieved by transferring the entanglement to something else, such as the external 
environment. (In the latter case there will still be entanglement, but it will be be- 
tween one of the qubits and the environment, rather than between the two original 
qubits) . 

6.6. Entanglement and entropy. So far we have assumed that we are able to 
study a single particle or a few particles with perfect knowledge of the state. This is 
called a statistically pure state, or often more simply, a pure state. In experiments 
it can be difficult to prepare a system in a pure state. More typically there is an 
ensemble of particles that might be in different states, or we might have incomplete 
knowledge of the states. Such a situation, in which there is a nonzero probability 
for the particle to be in more than one state, is called a mixed state. As we explain 
below, von Neumann developed an alternative formalism for quantum mechanics 
in terms of what is called a density matrix, which replaces the wavefunction as 
the elementary level of description. The density matrix representation very simply 
handles mixed states, and leads to a natural way to measure the entropy of a 
quantum mechanical system and measure entanglement. 

Consider a mixed state in which there is a probability pi for the system to have 
wavefunction tpi and an observable characterized by operator A. The average value 
measured for the observable (also called its expectation value) is 

(A)=Y,Pi(i>i\Am. (6.16) 

i 

We can expand each wavefunction ipi in terms of a basis \xj) in the form 

\ipi) = ^2{xj\i>i)\xj), 

j 

where in our earlier notation (Xj\ipi) = a j '■ Performing this expansion for the 



dual vector (ipi\ as well, substituting into (6. 16 1 and interchanging the order of 
summation gives 



(A) = EfE^Od^K^ 



Xk) (Xk\A\ Xj ) 



J2(Xj\p\Xk)(Xk\A\xj) 

jM 

tr(pA), 



where 

P = X>Wi>MI (6-17) 

i 

is called the density matri^\ Because the trace tr(pA) is independent of the 
representation this can be evaluated in any convenient basis, and so provides an 
easy way to compute expectations. Note that tr(p) = 1. For a pure state pi — 1 

2o The density matrix provides an alternative representation for quantum mechanics - the 
Schrodinger equation can be rewritten in terms of the density matrix so that we never need to 
use wavefunctions at all. 



49 



for some value of i and pi — otherwise. In this case the density matrix has rank 
one. This is obvious if we write it in a basis in which it is diagonal - there will only 
be one nonzero element. When there is more than one nonzero value of pi it is a 
mixed state and the rank is greater than one. 

To get a better feel for how this works, consider the very simple example of a 
single qubit, and let tpi = |1). If this is a pure state then the density matrix is just 

p=id(ii=(; % 

The expectation of the spin along the z-axis is tv{ps z ) = 1/2. If, however, the 
system is in a mixed state with 50% of the population spin up and 50% spin down, 
this becomes 



P =\{m\ + \m) = \§ J) 



In this case the expectation of the spin along the z-axis is tr(ps 2 ) = 0. 

This led von Neumann to define the entropy of a quantum state in analogy with 
the Gibbs entropy for a classical ensemble as 

S{p) = -tr plogp = -^Pilogpi . (6.18) 

i 

The entropy of a quantum state provides a quantitative measure of "how mixed" 
a system is. The entropy of a pure state is equal to zero, whereas the entropy of a 
mixed state is greater than zero. 

In some situations there is a close relationship between entangled and mixed 
states. An entangled but pure state in a high dimensional multi-qubit space can 
appear to be a mixed state when viewed from the point of view of a lower dimen- 
sional state space. The view of the wavefunction from a lower dimensional subspace 
is formally taken using a partial trace. This is done by summing over all the co- 
ordinates associated with the subspaces we want to ignore. This corresponds to 
leaving some subsystems out of consideration, for example, because we can only 
measure a certain qubit and can't measure the qubits on which we perform the 



partial trace. As an example consider the entangled state of equation (6.141, and 
trace it with respect to the second qubit. To do this we make use of the fact 
that ti(\ip)(<l>\) = {tjj\4>). Using labels A and B to keep the qubits straight, and 
remembering that because we are using orthogonal coordinates terms of the form 
(0|1) = 0, the calculation can be written 

\x{\^ab)^>ab\) = ^tr(|l) A (l| B + |Q) A (0|fl)((0| B (0U + (l| s (lU) 

= i(|iMiU(i|i) B + |o) A (ou(o|o) B ) 

= 1 -{\1) A {1\ A + \Q) A {Q\ A ) 

This is a mixed state with probability 1/2 to be either spin up or spin down. The 
corresponding entropy is also higher: In base two S = — log(l/2) = 1 bit, while 
for the original pure state 5* = log 1 = 0. In general if we begin with a statistically 
pure separable state and perform a partial trace we will still have a pure state, but 
if we begin with an entangled state, when we perform a partial trace we will get a 
mixed state. In the former case the entropy remains zero, but in the latter case it 
increases. Thus the von Neumann entropy yields a useful measure of entanglement. 
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6.7. Measurement and Decoherence. In this section we return to the measure- 
ment problem and the complications that arise if one wants to couple a classical 
measurement device to a quantum system. A classical system is by definition de- 
scribed in terms of macro-states, and one macro-state can easily correspond to 
10 40 micro-states. A classical measurement apparatus like a Geiger counter or a 
photo multiplier tube is prepared in a meta-stable state in which an interaction 
with the quantum system can produce a decay into a more stable state indicating 
the outcome of the measurement. For example, imagine that we want to detect the 
presence of an electron. We can do so by creating a detector consisting of a meta- 
stable atom. If the electron passes by its interaction with the meta-stable atom 
via its electromagnetic field can cause the decay of the meta-stable atom, and we 
observe the emission of a photon. If it doesn't pass by we observe nothing. There 
are very many possible final states for the system, corresponding to different micro- 
states of the electron and the photon, but we aren't interested in that - all we want 
to know is whether or not a photon was emitted. Thus we have to sum over all pos- 
sible combined photon-electron configurations. This amounts to tracing the density 
matrix of the complete system consisting of the electron and the measurement ap- 
paratus over all states in which a photon is present in the final state. This leads to 
a reduced density matrix describing the electron after the measurement, with the 
electron in a mixed state, corresponding to the many possible photon states. Thus 
even though we started with a zero entropy pure state in the combined system of 
the electron and photon, we end up with a positive entropy mixed state in the space 
of the electron alone. The state of the electron is reduced to a classical probability 
distribution, and due to the huge number of microstates that are averaged over, 
the process of measurement is thermodynamically irreversible. Even if we do not 
observe the outcoming photon with our own eyes, it is clear whether or not the 
metastable atom decayed, and thus whether or not the electron passed by. 

The description of the measurement process above is an example of decoherence, 
i.e. of a process whereby quantum mechanical systems come to behave as if they 
were governed by classical probabilities. A common way for this to happen is for 
a quantum system to interact with its environment, or for that matter any other 
quantum system, in such a way that the reduced density matrix for the system of 
interest becomes diagonal in a particular basis. The phases are randomized, so that 
after the measurement the system is found to be in a mixed state. According to 
this view, the wavefunction does not actually collapse, there is just the appearance 
of a collapse due to quantum decoherence. The details of how this happen remain 



controversial, and is a subject of active research [7d1 [771 153 EE]- In Section 6.11 
we will give an example of how decoherence can be generated even by interactions 
between simple systems. 

6.8. The no-cloning theorem. We have seen that by doing a measurement we 
may destroy the original state. One important consequence connected to this de- 
structive property of the act of measurement is that a quantum state cannot be 
cloned; one may be able to transfer a state from one register to another but one 
cannot make a xerox copy of a given quantum state. This is expressed by the no- 
cloning theorem [7U |3D]. Worded differently, the no-cloning theorem states that 
for an arbitrary state \ip±) on one qubit and some particular state \<p) on another, 
there is no quantum device [A] that transforms (g> \4>) — > \4>x) ® \4>i)i i.e. that 
transforms \4>) into ip±). Letting Ua be the unitary operator representing A, this 
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can be rewritten | } | V^i ) = Uj^\ipi)\4>) • For a true cloning device this property has 
to hold for any other state \ip2), i.e. we must also have |02)|02) = ?7a|02)|0)- We 
now show the existence of such a device leads to a contradiction. Since (<fc\<f>) = 1 
and U a IIa = 1, and f7,4j0j)|0) — UA\(p)\ipi), the existence of a device that can clone 
both ipi and 02 would imply that 

(0X102) = (\<t>)\^)) - ((0l|(0l4) (W>l^)) = ((01 1 1) (1^2)102)) 

= (0l|0 2 ) 2 - 

The property (0i|02) — (01 102) 2 only holds if 0j and 02 are either orthogonal 
or equal, i.e. it does not hold for arbitrary values of 0i and 02, so there can 
be no such general purpose cloning device. In fact, in view of the uncertainty of 
quantum measurements, the no-cloning theorem does not come as a surprise: If 
it were possible to clone wave functions, it would be possible to circumvent the 
uncertainty of quantum measurements by making a very large number of copies of 
a wavefunction, measuring different properties of each copy, and reconstructing the 
exact state of the original wavefunction. 

6.9. Quantum teleportation. Quantum teleportation provides a method for pri- 
vately sending messages in a way that ensures that the receiver will know if anyone 
eavesdrops. This is possible because a quantum state is literally teleported, in the 
sense of Star Trek: A quantum state is destroyed in one place and recreated in an- 
other. Because of the no-cloning theorem, it is impossible to make more than one 
copy of this quantum state, and as a result when the new teleported state appears, 
the original state must be destroyed. Furthermore, it is impossible for both the 
intended receiver and an eavesdropper to have the state at the same time, which 
helps make the communication secure. 

Quantum teleportation takes advantage of the correlation between entangled 
states as discussed in Section |6.5| Suppose Alice wants to send a secure message 
to Charlie at a (possibly distant) location. The process of teleportation depends 
on Alice and Charlie sharing different qubits of an entangled state. Alice makes 
a measurement of her part of the entangled state, which is coupled to the state 
she wants to teleport to Charlie, and sends him some classical information about 
the entangled state. With the classical information plus his half of the entangled 
state, Charlie can reconstruct the teleported state. We have indicated the process 
in figure [7] We follow the method proposed by Bennett et al. [TU] , and first realized 
in an experimental setup by the group of Zeilinger in 1997 [12 . In realistic cases the 
needed qubit states are typically implemented as left and right handed polarized 
light quanta (i.e. photons). 

The simplest example of quantum teleportation can be implemented with three 
qubits. The (A) qubit is the unknown state to be teleported, 

|0 A ) = a|l)+/3|O). (6.19) 

This state is literally teleported from one place to another. If Charlie likes, once 
he has the teleported state he can make a quantum measurement and extract the 
same information about a and /3 that he would have been able to extract had he 
made the measurement on the original state. 

The teleportation of this state is enabled by an auxiliary two-qubit entangled 
state. We label these two qubits B and C. For technical reasons it is convenient to 
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Teleported replica 




Original Entangled pair 



Figure 7. Quantum teleportation of a quantum state as proposed 
by Bennett et al.|10j. using an entangled pair. An explanation is 
given in the text. 



represent this in a special basis consisting of four states, called Bell states, which 
are written 

l*£c> = y|(|l B )|0 G )±|0 B )|l c )) 

\* ( bc) = JI(\1b)\1c)±\0b)\0c)). (6.20) 

The process of teleportation can be outlined as follows (please refer to Figure |7| . 

(1) Someone prepares an entangled two qubit state BC (the Entangled pair in 
the diagram). 

(2) Qubit B is sent to Alice and qubit C is sent to Charlie. 

(3) In the Scanning step, Alice measures in the Bell states basis the combined 
wavcfunction of qubits A (the original in the diagram) and the entangled 
state B, leaving behind the Disrupted original. 

(4) Alice sends two bits of classical data to Charlie telling him the outcome of 
her measurements (Send classical data). 

(5) Based on the classical information received from Alice, Charlie applies one 
of four possible operators to qubit C (Apply treament), and thereby recon- 
structs A, getting a teleported replica of the original. If he likes, he can now 
make a measurement on A to recover the message Alice has sent him. 

We now explain this process in more detail. In step (1) an entangled two qubit 



state tpBC such as that of (6.14) is prepared. In step (2) qubit B is transmitted 
to Alice and qubit C is transmitted to Charlie. This can be done, for example, by 
sending two entangled photons, one to each of them. In step (3) Alice measures 
the joint state of qubit A and B in the Bell states basis, getting two classical bits 
of information, and projecting the joint wavefunction ipAB onto one of the Bell 
states. The Bell states basis has the nice property that the four possible outcomes 
of the measurement have equal probability. To see how this works, for convenience 
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suppose the entangled state BC was prepared in state l^^). In this case the 
combined wavefunction of the three qubit state is 

\i>ABc) = |V>A>|*y> (6-21) 

= ^(|U)|l B )| 0c ) - \X A )\Q B )\lo)) + -|((U)|1 S )|Q C ) - \0a)\0b)\1c))- 

If this is expanded in the Bell states basis for the pair AB, it can be written in the 
form 

|<W) = § [|*i"J)(-a|lc) - /3|0 C » + |^ + i)(-a|l c > + /3|0 C » 

|$y >(/3|l c ) + a\0 c )) + \^)(-f3\l c ) + a\0 c ))] ■ (6.22) 
We see that the two qubit AB has equal probability to be in the four possible states 

l*i~J>, l*i + 2>, l^"2) ^d |*W>. 

In step (4), Alice transmits two classical bits to Charlie, telling him which of the 
four basis functions she observed. Charlie now makes use of the fact that in the 
Bell basis there are four possible states for the entangled qubit that he has, and his 
qubit C was entangled with Alice's qubit B before she made the measurement. In 



particular, let \4>c) be the state of the C qubit, which from (6.22 1 is one of the four 
states: 

i*> - ® ; (7) ; ; - d (if) ■ (6 - 23 » 

In step (5), based on the information that he receives from Alice, Charlie selects 
one of four possible operators Fi and uses it to measure the C qubit. There is one 
operator Fi for each of the four possible Bell states, which are respectively: 

Providing Charlie has the correct classical information and an intact entangled state 
he can reconstruct the original A qubit by measuring \4>c) with the appropriate 
operator Fi. 

=a\l)+p\Q) =Fi\<j>c) ■ (6.25) 
By simply multiplying each of the four possibilities it is easy to verify that as long 
as his information is correct, he will correctly reconstruct the A qubit a|l J 4) + /3|0 / i). 

We stress that Charlie needs the classical measurement information from Alice. 
If he could do without it the teleportation process would violate causality, since 
information could be transferred instantaneously from Alice to Charlie. That is, 
when Alice measures the B qubit, naively it might seem that because the B and 
C qubits are entangled, this instantaneously collapses the C qubit, sending Charlie 
the information about Alice's measurement, no matter how far away he is. To 
understand why such instantaneous communication is not possible, suppose Charlie 
just randomly guesses the outcome and randomly selects one of the four operators 
Fi . Then the original state will be reconstructed as a random mixture of the four 
possible incoming states \4>c)- This mixture does not give any information about 
the original state \ipA)- 

The same reasoning also applies to a possible eavesdropper, conveniently named 
Eve. If she manages to intercept qubit (C) and measures it before Charlie does, 
without the two bits of classical information she will not be able to recover the 
original state. Furthermore she will have affected that state. If Charlie somehow 
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gets the mutilated state he will not be able to reconstruct the original state A. 
Security can be achieved if Alice first sends a sequence of known states which 
can be checked by Charlie after reconstruction. If the original and reconstructed 
sequence are perfectly correlated then that guarantees that Eve is not interfering. 
Note that the cloning theorem is satisfied, since when Alice makes her measurement 
she alters the state ipA as well as her qubit B. Once she has done that, the only 
hope to reconstruct the original ipA is for her to send her measurement to Charlie, 
who can apply the appropriate operator to his entangled qubit C . 

The quantum security mechanism of teleportation is based on strongly corre- 
lated, highly non-local entangled states. While a strength, the non-locality of the 
correlations is also a weakness. Quantum correlations are extremely fragile and can 
be corrupted by random interactions with the environment, i.e. by decoherence. 
As we discussed before, this is a process in which the quantum correlations are 
destroyed and information gets lost. The problem of decoherence is the main stum- 
bling block in making progress towards large scale development and application of 
quantum technologies. Nevertheless, in 2006 the research group of Gisin at the 
University of Geneva succeeded in demonstrating teleportation over a distance of 
550 meters using the optical fiber network of Swisscom [36] . 

6.10. Quantum computation. Quantum computation is performed by setting up 
controlled interactions with non-trivial dynamics that successively couple individual 
qubits together and alter the time evolution of the wavefunction in a predetermined 
manner. A multi-qubit system is first prepared in a known initial state, representing 
the input to the program. Then interactions are switched on by applying forces, 
such as magnetic fields, that determine the direction in which the wavefunction 
rotates in its state space. Thus a quantum program is just a sequence of unitary 
operations that are externally applied to the initial state. This is achieved in 
practice by a corresponding sequence of quantum gates. When the computation is 
done measurements arc made to read out the final state. 

Quantum computation is essentially a form of analog computation. A physical 
system is used to simulate a mathematical problem, taking advantage of the fact 
that they both obey the same equations. The mathematical problem is mapped 
onto the physical system by finding an appropriate arrangement of magnets or other 
fields that will generate the proper equation of motion. One then prepares the ini- 
tial state, lets the system evolve, and reads out the answer. Analog computers 
are nothing new. For example, Leibnitz built a mechanical calculator for perform- 
ing multiplication in 1694, and in the middle of the twentieth century, because of 
their vastly superior speed in comparison with digital computers, electronic analog 
computers were often used to solve differential equations. 

Then why is quantum computation special? The key to its exceptional power is 
the massive parallelism at intermediate stages of the computation. Any operation 
on a given state works exactly the same on all basis vectors. The physical process 
that defines the quantum computation for an n qubit system thus acts in parallel 
on a set of 2" complex numbers, and the phases of these numbers (which would not 
exist in a classical computation) are important in determining the time evolution 
of the state. When the measurement is made to read out the answer at the end 
of the computation we are left with the n-bit output and the phase information is 
lost. 
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Because quantum measurements are generically probabilistic, it is possible for 
the 'same' computation to yield different "answers", e.g. because the measurement 
process projects the system onto different eigenstates. This can require the need 
for error correction mechanisms, though for some problems, such as factoring large 
numbers, it is possible to test for correctness by simply checking the answer to be 
sure it works. It is also possible for quantum computers to make mistakes due to 
decoherence, i.e. because of essentially random interactions between the quantum 
state used to perform the computation and the environment. This also necessitates 
error correction mechanisms. 

The problems caused by decoherence arc perhaps the central difficulty in creat- 
ing physical implementations of quantum computation. These can potentially be 
overcome by constructing systems where the quantum state is not encoded locally, 
but rather globally, in terms of topological properties of the system that cannot 
be disrupted by external (local) noise. This is called topological quantum comput- 
ing. This interesting possibility arises in certain two-dimensional physical media 
which exhibit topological order, referring to states of matter in which the essential 
quantum degrees of freedom and their interactions are topological |42[ 116] . 

6.11. Quantum gates and circuits. In the same way that classical gates are the 
building blocks of classical computers, quantum gates are the basic building blocks 
of quantum computers. A gate used for a classical computation implements binary 
operations on binary inputs, changing zeros into ones and vice versa. For example, 
the only nontrivial single bit logic operation is NOT, which takes to 1 and 1 to 0. 
In a quantum computation the situation is quite different, because qubits can exist 
in superpositions of and 1. The set of allowable single qubit operations consists 
of unitary transformations corresponding to 2 x 2 complex matrices U such that 
LfiU = 1. The corresponding action on a single qubit is represented in a circuit 
as illustrated in figure [8] Some quantum gates have classical analogues, but many 
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Figure 8. The diagram representing the action of a unitary ma- 
trix U corresponding to a quantum gate on a qubit in a state 



do not. For example, the operator X = I ^ is the quantum equivalent of the 
classical NOT gate, and serves the function of interchanging spin up and spin down. 
In contrast, the operation ^ rotates the phase of the wavefunction by 180 

degrees and has no classical equivalent. 

A general purpose quantum computer has to be able to transform an arbitrary n- 
qubit input into an n-qubit output corresponding to the result of the computation. 
In principle implementing such a computation might be extremely complicated, and 
might require constructing quantum gates of arbitrary order and complexity. 

Fortunately, it is possible to prove that the transformations needed to implement 
a universal quantum computer can be generated by a simple so-called universal 
- set of elementary quantum gates, for example involving a well chosen pair of a 
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one-qubit and a two-qubit gate. Single qubit gates are unitary matrices with three 
real degrees of freedom. If we allow ourselves to work with finite precision, the set 
of all gates can be arbitrary well approximated by a small well chosen set. There 
are many possibilities - the optimal choice depends on the physical implementation 
of the qubits. Typical one-qubit logical gates are for example the following: 



X = 
P{9) = 
H = 



1 

1 / 

1 

exp 10 

2 (l -1 



(6.26) 
(6.27) 
(6.28) 



X is the quantum equivalent of the classical NOT gate, serving the function of 
interchanging |1) and |0). The two other ones have no classical equivalent. The 
P(9) operation corresponds to the phase gate, it changes the relative phase by 9 
degrees, typically with 9 an irrational multiple of tt. For the third gate we can 
choose the so-called Hadamard gate H which creates a superposition of the basis 
states: |1) => |(|1> + |0)). 

From the perspective of experimental implementation, a convenient two-qubit 
gate is the CNOT gate. It has been shown that the CNOT in combination with 
the Hadamard gate forms a universal set [5]. The CNOT gate acts as follows on 
the state \A) <gs \B): 



CNOT : \ A) <g> \B) =► \A) ® \[A + B]mod 2) 



(6.29) 



In words, the CNOT gate flips the state of B if A — 1, and does nothing if A = 0. 
In matrix form one may write the CNOT gate as 
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(6.30) 



We have fully specified its action on the basis states in figure [9j 
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Figure 9. The circuit diagram representing the action of the 
CNOT gate defined in (6.301 on the four possible two-qubit basis 
states. The filled dot on the upper qubit denotes the control and 
the cross is the symbol for the conditional one qubit NOT gate. 
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With the CNOT gate one can generate an entangled state from a separable one, 
as follows: 

CNOT : -L(|0) + |1» ® |0) i=(|00) + |11)) . (6.31) 

In fact, from an intuitive point of view the ability to generate substantial speed- 
ups using a quantum computer vs. a classical computer is related to the ability 
to operate on the high dimensional state space including the entangled states. To 
describe a separable n-qubit state with k bits of accuracy we only need to describe 
each of the individual qubits separately, which only requires the order of nk bits. In 
contrast, to describe an rt-qubit entangled state we need the order of k bits for each 
dimension in the Hilbert space, i.e. we need the order of k2 n bits. If we were to 
simulate the evolution of an entangled state on a classical computer we would have 
to process all these bits of information and the computation would be extremely 
slow. Quantum computation, in contrast, acts on all this information at once - a 
quantum computation acting on an entangled state is just as fast as one acting on a 
separable state. Thus, if we can find situations where the evolution of an entangled 
state can be mapped into a hard mathematical problem, we can sometimes get 
substantial speedups. 

The CNOT gate can also be used to illustrate how decoherence comes about. 
Through the same action that allows it to generate an entangled state from a 
separable state, when viewed from the perspective of a single qubit, the resulting 



state becomes decoherent. That is, suppose we look at (6.31 1 in the density matrix 
representation. Looking at the first qubit only, the wavefunction of the separable 
state is \ip) = l/v2(|l) + |0)), or in the density matrix representation 

MM = ^(|i)<i| + |i)<o| + |o)<i| + |o)(o|) 
= i/i i 

2 \1 \ 

i A o 



Under the action of CNOT this becomes ^ I ^ ^ j , i.e. it becomes diagonal. 

6.12. Applications. At the present point in time there are many different efforts 
in progress to implement quantum computing. In principle all that is needed is a 
simple two level quantum system that can easily be manipulated and scaled up to 
a large number of qubits. The first requirement is not very restrictive, and many 
different physical implementations of systems with a single or a few qubits have been 
achieved, including NMR, spin lattices, linear optics with single photons, quantum 
dots, Josephson junction networks, ion traps and atoms and polar molecules in 
optical lattices [T5]. The much harder problem that has so far limited progress 
toward practical computation is to couple the individual qubits in a controllable 
way and to achieve a sufficiently low level of decoherence. With the great efforts now 
taking place, future developments could be surprisingly fastp'] If we had quantum 
computers at our disposal, what miracles would they perform? As we said in the 
introduction to this section, there are many problems where the intrinsic massive 
parallelism of quantum evolution might yield dramatic speedups. The point is not 
that a classical computer would not be able to do the same computation - after all, 



2 ^A first 16-qubit quantum computer has been announced by D-Wave Systems Inc. in Cali- 
fornia, but at the time of writing this product is not available yet. 
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one can always simulate a quantum computer on a classical one - but rather the time 
that is needed. As we mentioned already, the most spectacular speedup is the Shor 
algorithm (1994) for factorizing large numbers into their prime factors |62j . Because 
many security keys are based on the inability to factor large numbers into prime 
factors, the reduction from an exponentially hard to a polynomial hard problem 
has many practical applications for code breaking. Another important application 
is the quadratic speedup by Grover's algorithm (1996) [25] for problems such as 
the traveling salesman, in which large spaces need to be searched. Finally, an 
important application is the simulation of quantum systems themselves [3]. Having 
a quantum computer naturally provides an exponential speed-up, which in turn 
feeds back directly into the development of new quantum technologies. 

Quantum computation and security are another challenging instance of the sur- 
prising and important interplay between the basic concepts of physics and informa- 
tion theory. If physicists and engineers succeed in mastering quantum technologies 
it will mark an important turning point in information science. 



7. Black Holes: a space time information paradox 

In this section we make a modest excursion into the realm of curved space-time 
as described by Einstein's theory of general relativity As was realized only in 
the 1970's, this theory poses an interesting and still not fully resolved information 
paradox for fundamental physics. In general relativity gravity is understood as a 
manifestation of the curvature of space-time: the curvature of space-time deter- 
mines how matter and radiation propagate, while at the same time matter and 
radiation determine how space-time is curved. Particles follow geodesies in curved 
space-time to produce the curvilinear motion that we observe. 

An unexpected and long-ignored prediction of general relativity was the exis- 
tence of mysterious objects called black holes that correspond to solutions with 
a curvature singularity at their center. Black holes can be created when a very 
massive star burns all of its nuclear fuel and subsequently collapses into an ultra- 
compact object under its own gravitational pull. The space-time curvature at the 
surface of a black hole is so strong that even light cannot escape - hence the term 
"black hole" . The fact that the escape velocity from a black hole is larger then the 
speed of light implies, at least classically, that no information from inside the black 
hole can ever reach far away observers. The physical size of a black hole of mass M 
is defined by its event horizon, which is an imaginary sphere centered on the hole 
with a radius (called the Schwarzchild radius) 

Us - , (7.1) 

where Gn is Newton's gravitational constant and c is the velocity of light. For a 
black hole with the mass of the sun this yields R$ — 3km, and for the earth only 
Rg = 1cm! The only measurable quantities of a black hole for an observer far away 
are its mass, its charge and its angular momentum. 

But what about the second law of thermodynamics? If we throw an object 
with non-zero entropy into black hole, it naively seems that the entropy would 
disappear for ever and thus the total entropy of the universe would decrease, causing 
a blunt violation of the second law of thermodynamics. In the early 1970's, however, 
Bekenstein [7] and Hawking |4J showed that it is possible to assign an entropy to 
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a black hole. This entropy is proportional to the area A = 4tt(Rs) 2 of the event 
horizon, 

A striking analogy with the laws thermodynamics became evident: The change of 
mass (or energy) as we throw things in leads according to classical general relativity 
to a change of horizon area, as the Schwarzchild radius also increases. For an 
electrically neutral, spherically symmetric black hole, it is possible to show that the 
incremental change of mass dM of the black hole is related to the change of area 
dA as 

dM = £-dA (7.3) 

Z7T 

where k — hc/2R s is the surface gravity at the horizon. One can make an analogy 
with thermodynamics, where dA plays the role of "entropy" , dM the role of "heat" , 
and the k the role of "temperature" . Since no energy can leave the black hole, dM is 
positive and therefore dA > 0, analogous to the second law of thermodynamics. At 
this point the correspondence between black hole dynamics and thermodynamics 
is a mere analogy, because we know that a classical black hole does not radiate 
and therefore has zero temperature. One can still argue that the information is not 
necessarily be lost, it is only somewhere else and unretrievable for certain observers. 

What happens to this picture if we take quantum physics into account? Steven 
Hawking was the first to investigate the quantum behavior of black holes and his 
results radically changed their physical interpretation. He showed [32j [33] that 
if we apply quantum theory to the spacetime region close to the horizon then 
black holes aren't black at all! Using arguments based on the spontaneous creation 
of particle-antiparticle pairs in the strong gravitational field near the horizon he 
showed that a black hole behaves like a stove, emitting black body thermal radiation 
of a characteristic temperature, called the Hawking temperature, given bjQ 

he he 3 . . 



fully consistent with the first law (7.3 1. We see that the black hole temperature is 
inversely proportional to its mass, which means that a black hole becomes hotter 
and radiates more energy as it becomes lighter. In other words, a black hole will 
radiate and lose mass at an ever-increasing rate until it finally explode^] 

We conclude that quantum mechanics indeed radically changes the picture of 
a black hole. Black holes will eventually evaporate, presumably leaving nothing 
behind except thermal radiation, which has a nonzero entropy. However, as we 
discussed in the previous section, if we start with a physical system in a pure state 
that develops into a black hole, which subsequently evaporates, then at the level of 
quantum mechanics the information about the wavefunction should be rigorously 
preserved - the quantum mechanical entropy should not change. 

It may be helpful to compare the complete black hole formation and evaporation 
process with a similar, more familiar situation (proposed by Sidney Coleman) where 

2 ^We recall that we adopted units where Boltzmann's constant k is equal to one. 

2 *The type of blackholes that are most commonly considered are very massive objects like 
collapsed stars. The lifetime of a black hole is given by t ~ G 2 N M^ /he 1 which implies that the 
lifetime of such a massive black hole is on the order of r > 10 50 years (much larger than the 
lifetime of the universe to ~ 10 10 y). Theoretical physicists have also considered microscopic 
black holes, where the information paradox we are discussing leads to a problem of principle. 
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we know that quantum processes conserve entropy. Imagine a piece of coal at zero 
temperature (where by definition 5 = 0) that gets irradiated with a given amount 
of high entropy radiation, which we assume gets absorbed completely. It brings 
the coal into an excited state of finite temperature. As a consequence the piece of 
coal starts radiating, but since there is no more incoming radiation, it eventually 
returns to the zero temperature state, with zero entropy. As the quantum process 
of absorbing the initial radiation and emitting the outgoing radiation is unitary, it 
follows that the outcoming radiation should have exactly the same entropy as the 
incoming radiation. 

Thus, if we view the complete process of black hole formation and subsequent 
evaporation from a quantum mechanical point of view there should be no loss of 
information. So if the initial state is a pure state than a pure state should come 
out. But how can this be compatible with the observation that only thermal radi- 
ation comes out, independent of what we throw in? Thermal radiation is produced 
by entropy generating processes, is maximally uncorrelated and random, and has 
maximal entropy. If we throw the Encyclopedia Brittanica into the black hole and 
only get radiation out, its highly correlated initial state would seem to have been 
completely lost. This suggests that Hawking's quantum calculation is in some way 
incomplete. These conflicting views on the process of black hole formation and 
evaporation are referred to as the black hole information paradox. It has given rise 
to a fundamental debate in physics between the two principle theories of nature: 
the theory of relativity describing space-time and gravity on one hand and the 
theory of quantum mechanics describing matter and radiation on the other. Does 
the geometry of Einstein's theory of relativity prevail over quantum theory, or visa 
versa? 

If quantum theory is to survive one has to explain how the incoming information 
gets transferred to the outgoing radiation coming from the horizorpj so that a 
clever quantum detective making extremely careful measurements with very fancy 
equipment could recover it. If such a mechanism is not operative the incoming 
information is completely lost, and the laws of quantum mechanics are violated. 
The question is, what cherished principles must be given up? 

There is a generic way to think about this problem along the lines of quan- 
tum teleportation and a so-called final state projection |35[ 148] . We mentioned 
that Hawking radiation can be considered as a consequence of virtual particle- 
antiparticle pair production near the horizon of the black hole. The pairs that are 
created and separated at the horizon are in a highly entangled state, leading to 
highly correlated in-falling and outgoing radiation. It is then possible, at least in 
principle, that the interaction between the in-falling radiation and the in-falling 
matter (making the black hole) would lead to a projection in a given quantum 
state. Knowing that final state - for example by proving that only a unique state 
is possible - one would instantaneously have teleported the information from the 
incoming mass ( qubit A) to the outgoing radiation (qubit C) by using the entan- 
gled pair (qubit pair BC) in analogy with the process of teleportation we discussed 
in section |6.9| . The parallel with quantum teleportation is only partial, because 



It has been speculated by a number of authors that there is the logical possibility that the 
black hole does not disappear altogether, but leaves some remnant behind just in order to preserve 
the information. The final state of the remnant should then somehow contain the information of 
the matter thrown in. 
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in that situation the sender Alice (inside the black hole) has to send some classical 
information on the outcome of her measurements to the receiver Charlie (outside 
the black hole) before he is able decode the information in the outcoming radia- 
tion. But sending classical information out of a black hole is impossible. So this 
mechanism to rescue the information from the interior can only work if there is a 
projection onto an a priori known unique final state, so that it is as if Alice made a 
measurement yielding this state and sent the information to Charlie. But how this 
assumption could be justified is still a mystery. 

A more ambitious way to attack this problem is to attempt to construct a quan- 
tum theory of gravity where one assumes the existence of microscopic degrees of 
freedom so that the thermodynamic properties of black holes could be explained 
by the statistical mechanics of these underlying degrees of freedom. Giving the 
quantum description of these new fundamental degrees of freedom would then al- 
low for a unitary description. Before we explain what these degrees of freedom 
might be, let us first consider another remarkable property of black holes. As we 
explained before, the entropy of systems that are not strongly coupled is an exten- 
sive property, i.e. proportional to volume. The entropy of a black hole, in contrast, 
is proportional to the area of the event horizon rather than the volume. This di- 
mensional reduction of the number of degrees of freedom is highly suggestive that 
all the physics of a black hole takes place at its horizon, an idea introduced by 't 
Hooft and Susskind [66], that is called the holographic principle^\ 

Resolving the clash between the quantum theory of matter and general relativity 
of space-time is one of the main motivations for the great effort to search for a theory 
that overarches all of fundamental physics. At this moment the main line of attack 
is based on superstring theory, which is a quantum theory in which both matter 
and space-time are a manifestation of extremely tiny strings (Z = 10 _35 m). This 
theory incorporates microscopic degrees of freedom that might provide a statistical 
mechanical account of the entropy of black holes. In 1996 Strominger and Vafa[65l 
managed to calculate the Bekenstein-Hawking entropy for (extremal) black holes in 
terms of microscopic strings using a property of string theory called duality, which 
allowed them to count the number of accessible quantum states. The answer they 
found implied that for the exterior observer information is preserved on the surface 
of the horizon, basically realizing the holographic principle. 

There are indeed situations (so-called Anti-de Sitter/Conformal Field Theory du- 
alities or AdS/CFT models) in string theory describing space-times with a boundary 
where the holographic principle is realized explicitly. One hopes that in such models 
the complete process of formation and evaporation of a black hole can be described 
by the time evolution of its holographic image on the boundary, which in this case 
is a super-symmetric gauge theory, a well behaved quantum conformal field theory 
(CFT). A caveat is that in this particular Anti-de Sitter (AdS) classical setting so 
far only a static "eternal" black hole solution has been found, so interesting as that 
situation may be, it doesn't yet allow for a decisive answer to a completely realis- 
tic process of black hole formation and evaporation. Nevertheless, the communis 



A hologram is a two dimensional image that appears to be a three dimensional image; in a 
similar vein, a black hole is a massive object for which everything appears to take place on the 
surface. 
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opinion - at least for the moment - is that the principles of quantum theory have 
successfully passed a severe testj^j] [rJ5] . 

8. Conclusion 

The basic method of scientific investigation is to acquire information about na- 
ture by doing measurements and then to make models which optimally compress 
that information. Therefore information theoretic questions arise naturally at all 
levels of scientific enterprise: in the analysis of measurements, in performing com- 
puter simulations, and in evaluating the quality of mathematical models and theo- 
ries. 

The notion of entropy started in thermodynamics as a rather abstract math- 
ematical property. With the development of statistical mechanics it emerged as 
a measure of disorder, though the notion of disorder referred to a very restricted 
context. With the passage of time the generality and the power of the notion of en- 
tropy became clearer, so that now the line of reasoning is easily reversed - following 
Jaynes, statistical mechanics is reduced to an application of the maximum entropy 
principle, using constraints that are determined by the physical system. Forecast- 
ing is a process whose effectiveness can be understood in terms of the information 
contained in measurements, and the rate at which the geometry of the underlying 
dynamical system, used to make the forecast, causes this information to be lost. 
And following Rissanen, the whole scientific enterprise is reduced to the principle 
of minimum description length, which essentially amounts to finding the optimal 
compromise between the information contained in a model and the information 
contained in the discrepancies between the model and the data. 

Questions related to the philosophy of information have lead us naturally back 
to some of the profound debates in physics on the nature of the concept of entropy 
as it appears in the description of systems about which we have a priori only lim- 
ited information. The Gibbs paradox, for example, centers around the question of 
whether entropy is subjective or objective. We have seen that while the description 
might have subjective components, whenever we use the concept of entropy to ask 
concrete physical questions, we always get objective physical answers. Similarly, 
when we inject intelligent actors into the story, as for Maxwell's demon, we see that 
the second law remains valid - it applies equally well in a universe with sentient 
beings. 

Fundamental turning points in physics have always left important traces in infor- 
mation theory. A particularly interesting example is the development of quantum 
information theory, with its envisaged applications to quantum security, quantum 
teleportation and quantum computation. Another interesting example is the black 
hole information paradox, where the notions of entropy and information continue 
to be central players in our attempts to resolve some of the principal debates of 
modern theoretical physics. In a sense, our ability to construct a proper statistical 



Indicative is that a long standing bet between Hawking and Presskil of Caltech was settled in 
2004 when Hawking officially declared defeat. In doing so he recognized the fact that information 
is not lost when we throw something into a black hole — quantum correlations between the in- 
falling matter and the out-coming radiation should in principle make it possible to retrieve the 
original information. 



63 



mechanics is a good test of our theories. If we could only formulate an underly- 
ing statistical mechanics of black holes, we might be able to resolve fundamental 
questions about the interface between gravity and quantum mechanics. 

Finally, as we enter the realm of nonequilbrium statistical mechanics, we see 
that the question of what information means and how it can be used remains vital. 
New entropies are being defined, and their usefulness and theoretical consistency 
are topics that are actively debated. The physics of information is an emerging 
field, one that is still very much in progress. 
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